Reliability

Reliability = the ability of a system to keep working correctly over time, even when components fail. The foundation of SRE.

What’s Here

availability — SLA, SLO, error budgets, availability tiers
resilience — Circuit breakers, retries with backoff, bulkheads, graceful degradation
load-balancing — LB algorithms, health checks, L4 vs L7
idempotency — Designing APIs that are safe to retry
memory-leaks — Detection, prevention, and impact on reliability

Quick Reference

Failure Mode Analysis:
  Every component WILL fail.
  Question: what is the blast radius?

Reliability Patterns (in order of impact):
  1. Redundancy — N+1 instances, multi-AZ
  2. Health checks — detect failures fast
  3. Circuit breakers — stop cascading failures
  4. Graceful degradation — serve what you can
  5. Idempotency — make retries safe
6. Observability — you can't fix what you can't see

Key Metrics

Metric	Meaning
Error Budget	Allowable downtime per period (SLO target vs actual)
MTTR	Mean Time To Recovery — how fast you recover
MTTF	Mean Time To Failure — how long until first failure
Availability	Uptime / (Uptime + Downtime) as a percentage

caching — Caching impacts reliability (cache failures cascade)
shift-left — Security testing improves reliability
scaling — Scaling for reliability

cloudnative wiki

Explorer

Reliability

Reliability

What’s Here

Quick Reference

Key Metrics

Graph View

Table of Contents

cloudnative wiki

Explorer

Reliability

Reliability

What’s Here

Quick Reference

Key Metrics

Related

Graph View

Table of Contents