Resilience

Resilience = the ability to keep serving even when components fail. Not about preventing failures — about handling them gracefully.


The Three Pillars

1. Prevention:   Stop failures from happening
2. Detection:    Find failures fast when they happen
3. Recovery:     Recover from failures automatically

Most effort goes to detection and recovery. Prevention is impossible at scale.


Core Patterns

1. Timeouts

Every call to an external service must have a timeout. No timeouts = request hangs forever.

# ❌ No timeout — request can hang indefinitely
response = requests.get("https://api.example.com/data")
 
# ✅ With timeout
response = requests.get("https://api.example.com/data", timeout=3)

Rule: Set timeouts at the 99th percentile of expected latency, not arbitrary values.

2. Circuit Breakers

CLOSED (normal):       requests pass through
 ↓ failure threshold
OPEN (failing):        requests rejected immediately (fallback)
  ↓ recovery timeout
HALF-OPEN (testing):  one probe request to test recovery
  ↓ success ↓ failure
CLOSED                 OPEN
from circuitbreaker import circuit
 
@circuit(failure_threshold=5, recovery_timeout=30)
def call_external_service():
    return external_api.get("/data")

3. Retry with Exponential Backoff

import time, random
 
def retry_with_backoff(func, max_attempts=3, base_delay=1):
    for attempt in range(max_attempts):
        try:
            return func()
        except RetryableError as e:
            if attempt == max_attempts - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Key: Add jitter (randomization) to prevent thundering herd.

Without jitter:  all clients retry at t=1, t=2, t=4...
With jitter:     clients retry spread across t=1-2, t=2-4...

4. Bulkheads

Isolate failures so one component’s failure doesn’t cascade.

Traditional (shared pool):     Bulkhead (isolated pools):
┌─────────────────────┐       ┌────────┐ ┌────────┐ ┌────────┐
│      Thread Pool    │       │ Pool A │ │ Pool B │ │ Pool C │
│  (all services share)│       │ (svc A)│ │ (svc B)│ │ (svc C)│
└─────────────────────┘       └────────┘ └────────┘ └────────┘
 svc A overload ──▶ all svc fail svc A overload ──▶ only A fails

In K8s: separate Deployment per service with its own resource limits.

5. Graceful Degradation

When a dependency fails, serve a useful fallback instead of a hard error.

def get_product_detail(product_id):
    try:
        # Primary: from recommendation engine
        return recommendation_engine.get(product_id)
    except ServiceUnavailable:
        # Fallback: from static cache
        return static_cache.get(product_id)
    except Exception:
        # Last resort: return minimal data
        return {"product_id": product_id, "name": "Default Product"}

6. Health Checks

# Kubernetes-style health endpoints
@app.get("/health/live")
def liveness():
    return {"status": "ok"}  # I'm alive
 
@app.get("/health/ready")
def readiness():
    if not db.is_connected():
        return {"status": "not_ready", "reason": "db_disconnected"}, 503
    if not redis.is_connected():
        return {"status": "not_ready", "reason": "cache_disconnected"}, 503
    return {"status": "ready"}
CheckPurposeLB Removes Instance?
/health/liveProcess is aliveNo (never kill)
/health/readyReady to serve trafficYes

Chaos Engineering

Test resilience by deliberately breaking things in staging.

Game Day: intentionally kill a service, verify alarms fire, runbook executes

Principles:

  1. Blast radius: start small (1 pod,1 AZ)
  2. Hypothesis: “we expect X to happen”
  3. Measure: did the system behave as expected?
  4. Automate: repeat in CI

Tools

ToolWhat It Breaks
Chaos Monkey (Netflix)Random service kill
GremlinCPU, memory, network, I/O
LitmusK8s resources
kube-monkeyK8s pod kill
AWS Fault Injection SimulatorAWS resources

Reliability vs Availability

PropertyDefinitionWhat It Measures
ReliabilityProbability system works correctly over time”Did we serve the right answer?”
AvailabilityProportion of time system is operational”Is the system up?”
Reliable but not available:  wrong answers fast
Available but not reliable:  right answers slowly (or not at all)
Both:                        right answers, fast, always

Quick Checklist

□ All external calls have timeouts
□ Circuit breakers on all dependencies
□ Retries with exponential backoff + jitter
□ Bulkhead isolation (separate thread pools / deployments)
□ Graceful degradation for non-critical dependencies
□ Health endpoints: /live and /ready
□ Chaos engineering in staging (game days)
□ Runbook for every failure scenario
□ Observability: latency percentiles, error rates, circuit state

Source