Resilience

Resilience = the ability to keep serving even when components fail. Not about preventing failures — about handling them gracefully.

The Three Pillars

1. Prevention:   Stop failures from happening
2. Detection:    Find failures fast when they happen
3. Recovery:     Recover from failures automatically

Most effort goes to detection and recovery. Prevention is impossible at scale.

Core Patterns

1. Timeouts

Every call to an external service must have a timeout. No timeouts = request hangs forever.

# ❌ No timeout — request can hang indefinitely
response = requests.get("https://api.example.com/data")
 
# ✅ With timeout
response = requests.get("https://api.example.com/data", timeout=3)

Rule: Set timeouts at the 99th percentile of expected latency, not arbitrary values.

2. Circuit Breakers

CLOSED (normal):       requests pass through
 ↓ failure threshold
OPEN (failing):        requests rejected immediately (fallback)
  ↓ recovery timeout
HALF-OPEN (testing):  one probe request to test recovery
  ↓ success ↓ failure
CLOSED                 OPEN

from circuitbreaker import circuit
 
@circuit(failure_threshold=5, recovery_timeout=30)
def call_external_service():
    return external_api.get("/data")

3. Retry with Exponential Backoff

import time, random
 
def retry_with_backoff(func, max_attempts=3, base_delay=1):
    for attempt in range(max_attempts):
        try:
            return func()
        except RetryableError as e:
            if attempt == max_attempts - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Key: Add jitter (randomization) to prevent thundering herd.

Without jitter:  all clients retry at t=1, t=2, t=4...
With jitter:     clients retry spread across t=1-2, t=2-4...

4. Bulkheads

Isolate failures so one component’s failure doesn’t cascade.

Traditional (shared pool):     Bulkhead (isolated pools):
┌─────────────────────┐       ┌────────┐ ┌────────┐ ┌────────┐
│      Thread Pool    │       │ Pool A │ │ Pool B │ │ Pool C │
│  (all services share)│       │ (svc A)│ │ (svc B)│ │ (svc C)│
└─────────────────────┘       └────────┘ └────────┘ └────────┘
 svc A overload ──▶ all svc fail svc A overload ──▶ only A fails

In K8s: separate Deployment per service with its own resource limits.

5. Graceful Degradation

When a dependency fails, serve a useful fallback instead of a hard error.

def get_product_detail(product_id):
    try:
        # Primary: from recommendation engine
        return recommendation_engine.get(product_id)
    except ServiceUnavailable:
        # Fallback: from static cache
        return static_cache.get(product_id)
    except Exception:
        # Last resort: return minimal data
        return {"product_id": product_id, "name": "Default Product"}

6. Health Checks

# Kubernetes-style health endpoints
@app.get("/health/live")
def liveness():
    return {"status": "ok"}  # I'm alive
 
@app.get("/health/ready")
def readiness():
    if not db.is_connected():
        return {"status": "not_ready", "reason": "db_disconnected"}, 503
    if not redis.is_connected():
        return {"status": "not_ready", "reason": "cache_disconnected"}, 503
    return {"status": "ready"}

Check	Purpose	LB Removes Instance?
`/health/live`	Process is alive	No (never kill)
`/health/ready`	Ready to serve traffic	Yes

Chaos Engineering

Test resilience by deliberately breaking things in staging.

Game Day: intentionally kill a service, verify alarms fire, runbook executes

Principles:

Blast radius: start small (1 pod,1 AZ)
Hypothesis: “we expect X to happen”
Measure: did the system behave as expected?
Automate: repeat in CI

Tools

Tool	What It Breaks
Chaos Monkey (Netflix)	Random service kill
Gremlin	CPU, memory, network, I/O
Litmus	K8s resources
kube-monkey	K8s pod kill
AWS Fault Injection Simulator	AWS resources

Reliability vs Availability

Property	Definition	What It Measures
Reliability	Probability system works correctly over time	”Did we serve the right answer?”
Availability	Proportion of time system is operational	”Is the system up?”

Reliable but not available:  wrong answers fast
Available but not reliable:  right answers slowly (or not at all)
Both:                        right answers, fast, always

Quick Checklist

□ All external calls have timeouts
□ Circuit breakers on all dependencies
□ Retries with exponential backoff + jitter
□ Bulkhead isolation (separate thread pools / deployments)
□ Graceful degradation for non-critical dependencies
□ Health endpoints: /live and /ready
□ Chaos engineering in staging (game days)
□ Runbook for every failure scenario
□ Observability: latency percentiles, error rates, circuit state

cloudnative wiki

Explorer

Resilience

Resilience

The Three Pillars

Core Patterns

1. Timeouts

2. Circuit Breakers

4. Bulkheads

5. Graceful Degradation

6. Health Checks

Chaos Engineering

Tools

Reliability vs Availability

Quick Checklist

Source

Graph View

Table of Contents

Backlinks