The Retry Storm
The Scenario
Your service just crashed. Within seconds, 10,000 clients detect the failure and start retrying. They all retry at exactly the same time, overwhelming your service as soon as it tries to recover. This is a Retry Storm - a cascading failure caused by synchronized retries.
The Problem
Your current retry logic uses exponential backoff, but with fixed intervals. When 10,000 clients all fail at the same time, they all calculate the same backoff (1s, 2s, 4s...). They all retry simultaneously, creating thundering herd waves that prevent recovery.
The Goal
Implement Full Jitter - randomized exponential backoff that spreads retries across time.
Instead of waiting exactly 4 seconds, wait a random time between 0 and 4 seconds. This breaks synchronization and gives your service breathing room.
Requirements:
- Use
random.uniform(0, backoff)to add jitter - Exponential backoff: base delay doubles each retry (1s, 2s, 4s, 8s...)
- Maximum retry attempts: 5
- Cap maximum backoff at 30 seconds