Reliability

Reliability = the ability of a system to keep working correctly over time, even when components fail. The foundation of SRE.


What’s Here

  • availability — SLA, SLO, error budgets, availability tiers
  • resilience — Circuit breakers, retries with backoff, bulkheads, graceful degradation
  • load-balancing — LB algorithms, health checks, L4 vs L7
  • idempotency — Designing APIs that are safe to retry
  • memory-leaks — Detection, prevention, and impact on reliability

Quick Reference

Failure Mode Analysis:
  Every component WILL fail.
  Question: what is the blast radius?

Reliability Patterns (in order of impact):
  1. Redundancy — N+1 instances, multi-AZ
  2. Health checks — detect failures fast
  3. Circuit breakers — stop cascading failures
  4. Graceful degradation — serve what you can
  5. Idempotency — make retries safe
6. Observability — you can't fix what you can't see

Key Metrics

MetricMeaning
Error BudgetAllowable downtime per period (SLO target vs actual)
MTTRMean Time To Recovery — how fast you recover
MTTFMean Time To Failure — how long until first failure
AvailabilityUptime / (Uptime + Downtime) as a percentage

  • caching — Caching impacts reliability (cache failures cascade)
  • shift-left — Security testing improves reliability
  • scaling — Scaling for reliability