Your System Will Fail. The Question Is How Gracefully.
I was on call when our payment service went down because a third-party fraud detection API started timing out. The timeout was 30 seconds. We had 200 concurrent requests, each holding a thread while waiting for a response that would never come. Within two minutes, the thread pool was exhausted, and our entire payment service — not just the fraud check — was unresponsive. Orders, refunds, balance queries — everything dead because one downstream dependency got slow.
This is the canonical failure mode that resilience patterns exist to prevent. Not the dramatic server-on-fire scenario, but the quiet, cascading kind where one slow dependency drags everything down with it. Circuit breakers, bulkheads, and retry patterns are the engineering tools that contain these failures before they become full outages.
Circuit Breakers: Stop Calling a Dead Service
A circuit breaker monitors calls to a downstream service and trips open when failures exceed a threshold. Once open, it immediately fails all requests without actually calling the downstream service, giving it time to recover.
The circuit breaker has three states:
- Closed: Normal operation. Requests flow through. Failures are counted.
- Open: Failures exceeded the threshold. All requests are immediately rejected without calling the downstream service.
- Half-Open: After a cooldown period, a limited number of test requests are allowed through. If they succeed, the breaker closes. If they fail, it opens again.
Implementation in Python
import time
import httpx
from enum import Enum
from dataclasses import dataclass, field
from threading import Lock
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 3
_state: CircuitState = field(default=CircuitState.CLOSED, init=False)
_failure_count: int = field(default=0, init=False)
_last_failure_time: float = field(default=0.0, init=False)
_half_open_calls: int = field(default=0, init=False)
_lock: Lock = field(default_factory=Lock, init=False)
@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = CircuitState.HALF_OPEN
self._half_open_calls = 0
return self._state
def record_success(self):
with self._lock:
if self._state == CircuitState.HALF_OPEN:
self._half_open_calls += 1
if self._half_open_calls >= self.half_open_max_calls:
self._state = CircuitState.CLOSED
self._failure_count = 0
else:
self._failure_count = 0
def record_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self._state = CircuitState.OPEN
if self._state == CircuitState.HALF_OPEN:
self._state = CircuitState.OPEN
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
raise CircuitOpenError(
f"Circuit is open. Will retry after {self.recovery_timeout}s. "
f"Last failure: {time.time() - self._last_failure_time:.1f}s ago."
)
try:
result = func(*args, **kwargs)
self.record_success()
return result
except Exception as e:
self.record_failure()
raise
class CircuitOpenError(Exception):
pass
# Usage
fraud_check_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30.0,
)
async def check_fraud(transaction):
try:
return fraud_check_breaker.call(
httpx.post,
"https://fraud-api.example.com/check",
json=transaction.dict(),
timeout=5.0,
)
except CircuitOpenError:
# Fallback: allow the transaction but flag for manual review
return FraudResult(approved=True, requires_review=True)
Circuit Breakers in Practice
The fallback behavior when the circuit is open is where the real engineering judgment lives. Options include:
| Strategy | When to Use | Example |
|---|---|---|
| Return cached data | Data staleness is acceptable | Product catalog, user preferences |
| Return a default | A safe default exists | Default shipping estimate, feature flags off |
| Degrade gracefully | Feature is optional | Skip recommendations, skip analytics |
| Fail fast with clear error | No safe fallback exists | Payment processing, auth checks |
| Queue for later | Action can be async | Email notifications, webhook delivery |
Bulkheads: Isolate the Blast Radius
The bulkhead pattern borrows from shipbuilding: ships have watertight compartments so that a hull breach in one compartment does not sink the entire vessel. In software, bulkheads isolate resources so that a failure in one component cannot exhaust resources needed by other components.
Thread Pool Bulkheads
The most common bulkhead implementation uses separate thread pools (or connection pools, or semaphores) for different downstream dependencies:
import asyncio
from dataclasses import dataclass
@dataclass
class Bulkhead:
name: str
max_concurrent: int
max_wait: float = 5.0
def __post_init__(self):
self._semaphore = asyncio.Semaphore(self.max_concurrent)
self._waiting = 0
async def execute(self, coro):
self._waiting += 1
try:
acquired = await asyncio.wait_for(
self._semaphore.acquire(),
timeout=self.max_wait,
)
except asyncio.TimeoutError:
self._waiting -= 1
raise BulkheadFullError(
f"Bulkhead '{self.name}' is full. "
f"{self.max_concurrent} calls in progress, "
f"{self._waiting} waiting."
)
self._waiting -= 1
try:
return await coro
finally:
self._semaphore.release()
class BulkheadFullError(Exception):
pass
# Separate bulkheads for each downstream service
payment_bulkhead = Bulkhead("payment-api", max_concurrent=20, max_wait=5.0)
fraud_bulkhead = Bulkhead("fraud-api", max_concurrent=10, max_wait=3.0)
inventory_bulkhead = Bulkhead("inventory-api", max_concurrent=30, max_wait=5.0)
async def process_order(order):
# Each call is isolated. If fraud API is slow and all 10 slots
# are occupied, it cannot steal capacity from payment or inventory.
payment = await payment_bulkhead.execute(
check_payment(order.payment_method)
)
fraud = await fraud_bulkhead.execute(
check_fraud(order)
)
inventory = await inventory_bulkhead.execute(
reserve_inventory(order.items)
)
Without bulkheads, all downstream calls share a single resource pool. When one service gets slow, it monopolizes the shared resources and every other service call suffers. With bulkheads, a slow fraud API can only consume its allocated 10 concurrent slots. The remaining capacity for payments and inventory remains untouched.
Retry Patterns: The Most Dangerous Tool in Your Toolbox
Retries are the most commonly implemented and most commonly misimplemented resilience pattern. A naive retry loop can turn a minor hiccup into a catastrophic retry storm that overwhelms the very service you are trying to reach.
The Wrong Way
# NEVER DO THIS
async def call_service(url, payload):
for attempt in range(5):
try:
return await httpx.post(url, json=payload, timeout=10)
except Exception:
pass # Retry immediately
raise Exception("Service unavailable")
# Why this is dangerous:
# - No backoff: hammers the failing service as fast as possible
# - No jitter: all clients retry at the exact same time
# - Retries on ALL exceptions, including 400 Bad Request
# - 5 retries * N clients = 5N requests to an already struggling service
The Right Way: Exponential Backoff with Jitter
import random
import asyncio
import httpx
async def call_with_retry(
url: str,
payload: dict,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
retryable_status_codes: set = {429, 502, 503, 504},
):
last_exception = None
for attempt in range(max_retries + 1):
try:
response = await httpx.AsyncClient().post(
url, json=payload, timeout=5.0
)
if response.status_code < 400:
return response
if response.status_code not in retryable_status_codes:
# Client error (4xx) - do NOT retry
raise NonRetryableError(
f"Request failed with {response.status_code}: {response.text}"
)
last_exception = HttpError(response.status_code, response.text)
except (httpx.ConnectTimeout, httpx.ReadTimeout) as e:
last_exception = e
except httpx.ConnectError as e:
last_exception = e
if attempt < max_retries:
# Exponential backoff with full jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jittered_delay = random.uniform(0, delay)
await asyncio.sleep(jittered_delay)
raise RetriesExhaustedError(
f"Failed after {max_retries + 1} attempts. Last error: {last_exception}"
)
Retry Budget Pattern
An even better approach is a retry budget that limits retries as a percentage of total traffic:
from collections import deque
from time import time
class RetryBudget:
"""Limits retries to a percentage of total requests over a time window."""
def __init__(self, max_retry_ratio=0.1, window_seconds=60, min_retries_per_second=10):
self.max_retry_ratio = max_retry_ratio
self.window_seconds = window_seconds
self.min_retries_per_second = min_retries_per_second
self._requests = deque()
self._retries = deque()
def _cleanup(self):
cutoff = time() - self.window_seconds
while self._requests and self._requests[0] < cutoff:
self._requests.popleft()
while self._retries and self._retries[0] < cutoff:
self._retries.popleft()
def record_request(self):
self._requests.append(time())
def can_retry(self) -> bool:
self._cleanup()
total_requests = len(self._requests)
total_retries = len(self._retries)
# Always allow a minimum retry rate
if total_retries < self.min_retries_per_second * self.window_seconds:
return True
# Check if retries exceed the budget
if total_requests == 0:
return True
return (total_retries / total_requests) < self.max_retry_ratio
def record_retry(self):
self._retries.append(time())
# Usage: retry budget shared across all callers of a service
payment_retry_budget = RetryBudget(max_retry_ratio=0.1) # Max 10% retries
Combining the Patterns
These patterns work best together. Here is how they compose:
async def resilient_call(service_name, url, payload):
"""
Call flow:
1. Check circuit breaker (fail fast if open)
2. Acquire bulkhead slot (fail if capacity exhausted)
3. Make the call with retry logic
4. Record result in circuit breaker
"""
breaker = circuit_breakers[service_name]
bulkhead = bulkheads[service_name]
budget = retry_budgets[service_name]
# Step 1: Circuit breaker check
if breaker.state == CircuitState.OPEN:
return get_fallback(service_name, payload)
# Step 2: Bulkhead
async with bulkhead:
# Step 3: Call with retries
try:
result = await call_with_retry(
url, payload,
max_retries=2 if budget.can_retry() else 0,
)
breaker.record_success()
return result
except Exception as e:
breaker.record_failure()
raise
Library Recommendations
You do not have to implement these patterns from scratch. Production-grade libraries exist for most languages:
| Language | Library | Patterns Supported |
|---|---|---|
| Java/Kotlin | Resilience4j | Circuit breaker, bulkhead, retry, rate limiter, time limiter |
| Go | sony/gobreaker | Circuit breaker |
| Go | avast/retry-go | Retry with backoff |
| Python | tenacity | Retry with backoff, jitter |
| Python | pybreaker | Circuit breaker |
| Node.js | cockatiel | Circuit breaker, bulkhead, retry, timeout |
| .NET | Polly | Circuit breaker, bulkhead, retry, timeout, fallback |
Resilience patterns are not optional complexity you add when things get serious. They are the difference between a minor dependency hiccup and a two-hour outage that wakes up the entire on-call rotation. Start with circuit breakers on your slowest downstream dependency. Add bulkheads when you have more than three downstream services. Implement retry budgets before your retry loops amplify the next incident. Your future self at 3 AM will be grateful.
