Retry Budgets in Go Services: Preventing Cascading Failure
Retries are useful until they become the reason your system is down. I have seen this pattern more than once: one dependency gets slower, callers retry aggressively, queue depth grows, CPU jumps, and the service that was already struggling now receives three times the normal traffic. The incident starts as a dependency issue and turns into a self-inflicted denial of service. The fix is not to remove retries. The fix is to give retries a budget. A retry budget makes every caller spend from a limited allowance, so the system can absorb short failures without amplifying long ones. The Failure Mode Imagine a Go API that calls a reporting service. The reporting service usually responds in 80ms, but during a deploy it starts taking 900ms. The API has a one second timeout and retries twice. Every user request can now become three downstream calls, and each one waits almost the full timeout before failing. If the original traffic is 200 requests per second, the downstream service may sudde...