← All posts

Retry Budgets in Go Services: Preventing Cascading Failure

Laptop running backend service dashboards

Retries are useful until they become the reason your system is down. I have seen this pattern more than once: one dependency gets slower, callers retry aggressively, queue depth grows, CPU jumps, and the service that was already struggling now receives three times the normal traffic. The incident starts as a dependency issue and turns into a self-inflicted denial of service.

The fix is not to remove retries. The fix is to give retries a budget. A retry budget makes every caller spend from a limited allowance, so the system can absorb short failures without amplifying long ones.

The Failure Mode

Imagine a Go API that calls a reporting service. The reporting service usually responds in 80ms, but during a deploy it starts taking 900ms. The API has a one second timeout and retries twice. Every user request can now become three downstream calls, and each one waits almost the full timeout before failing. If the original traffic is 200 requests per second, the downstream service may suddenly see 600 attempts per second while it is already unhealthy.

This is why retry logic belongs in architecture discussions, not just helper libraries. Retries change traffic shape.

A Simple Retry Budget

The simplest version I like is a token bucket per dependency. Successful requests slowly refill the bucket, retries spend from it, and when the bucket is empty the caller fails fast or returns a degraded response.

type RetryBudget struct {
    mu     sync.Mutex
    tokens int
    max    int
}

func NewRetryBudget(max int) *RetryBudget {
    return &RetryBudget{tokens: max, max: max}
}

func (b *RetryBudget) AllowRetry() bool {
    b.mu.Lock()
    defer b.mu.Unlock()
    if b.tokens <= 0 {
        return false
    }
    b.tokens--
    return true
}

func (b *RetryBudget) RecordSuccess() {
    b.mu.Lock()
    defer b.mu.Unlock()
    if b.tokens < b.max {
        b.tokens++
    }
}

In production I usually do this with time windows and metrics instead of a plain counter, but the idea is the same: retries are not free. They are a shared resource.

Where the Budget Should Live

The budget should live close to the client of the dependency. A global retry budget for the whole application sounds nice, but it hides which dependency is failing. A per-dependency budget lets the Amazon Ads client degrade independently from the billing client, and lets the SQS consumer behave differently from the public HTTP API.

  • Use a separate budget per external dependency.
  • Use a separate budget per tenant or advertiser when one noisy account can dominate traffic.
  • Export remaining budget as a metric so incidents are visible before they become outages.
  • Do not retry validation errors, permission errors, or deterministic failures.

Backoff Still Matters

Retry budgets do not replace exponential backoff with jitter. They complement it. Backoff controls when retries happen. The budget controls whether they should happen at all.

func retryDelay(attempt int) time.Duration {
    base := 100 * time.Millisecond
    max := 2 * time.Second
    d := base * time.Duration(1<<attempt)
    if d > max {
        d = max
    }
    jitter := time.Duration(rand.Int63n(int64(d / 2)))
    return d/2 + jitter
}

The Production Rule

My rule now is simple: every retry policy must answer three questions. What errors are retryable? How long can the caller wait? What stops retries when the dependency is already burning? If the answer to the third question is missing, the retry policy is incomplete.

Comments