Availability

Availability is the proportion of time a system is operational and accessible. It’s the first reliability metric stakeholders care about.


Key Terms

TermMeaningExample
SLA (Service Level Agreement)Contractual commitment to customers”99.9% uptime, or we pay credits”
SLO (Service Level Objective)Internal target you aim for”Target 99.95% uptime”
SLI (Service Level Indicator)What you actually measureReal p99 latency, real error rate
Error BudgetAllowable downtime per period4.38 min/month at 99.9%

Availability Tiers

TargetDowntime/YearDowntime/MonthDowntime/Week
90%36.5 days3 days16.8 hours
99%3.65 days7.3 hours1.7 hours
99.9%8.76 hours43.8 min10.1 min
99.95%4.38 hours21.9 min5.0 min
99.99%52.6 min4.4 min1.0 min
99.999%5.26 min26.3 sec6.1 sec

Rule: Each9 costs ~10x in complexity and infrastructure cost.


Measuring Availability

SLI Patterns

# Availability = successful requests / total requests
availability = successful_requests / total_requests * 100
 
# For a system with SLO of 99.9%:
# Allowable error budget: 0.1% of requests per window
# If you handle 1M req/day, error budget = 1000 failed req/day

Common SLIs

Service TypeGood SLIBad SLI
User-facing APIRequest success rate
Read-heavy dataCache hit ratio
Write-heavy dataCommit success rate
Background jobsJob completion rate
Data pipelineRecords processed / expected

Error Budgets

The error budget is the allowable amount of unreliability before you freeze features and focus on stability.

SLO: 99.9% (43.8 min/month downtime budget)
Actual: 99.95% (21.9 min/month downtime) ← budget healthy, ship features

Actual: 99.85% (65.7 min/month downtime) ← budget burning, halt feature work

Budget Policy

# Alert when error budget is burning fast
error_budget_remaining < 50%:
  severity: warning
  action: investigate reliability incidents
 
error_budget_remaining < 10%:
  severity: critical
  action: feature freeze, all hands on reliability

Designing for Availability

Redundancy Patterns

Single component:      Redundant:
┌──────────┐           ┌────┐  ┌────┐
│ DB    │           │ DB │  │ DB │ ← primary + replica
└──────────┘           └────┘  └────┘
   ↓ failure ↓ replica handles failover

Multi-AZ deployment:
┌──────────┐  ┌──────────┐  ┌──────────┐
│  AZ-1   │  │  AZ-2   │  │  AZ-3   │
│ ┌────┐ │  │  ┌────┐ │  │  ┌────┐ │
│  │app │ │  │  │app │ │  │  │app │ │
│  └────┘ │  │  └────┘ │  │  └────┘ │
│  ┌────┐ │  │  ┌────┐ │  │ ┌────┐ │
│  │ DB │ │  │  │ DB │ │  │  │ DB │ │
│  └────┘ │  │  └────┘ │  │  └────┘ │
└──────────┘  └──────────┘  └──────────┘
 ↑ AZ failure = handled by other AZs

High Availability Checklist

□ Active-active across2+ AZs (not active-passive)
□ Database with同步 replication (or equivalent)
□ Load balancer health checks with automatic removal
□ Graceful degradation (circuit breakers)
□ Health endpoints for orchestration (k8s readiness/liveness)
□ Regular chaos testing (game days)
□ Runbook for every failure scenario
□ Observability: SLO dashboard + error budget alerts

Common Causes of Downtime

CauseMitigation
Database overloadRead replicas, connection pooling, query limits
Cascading failuresCircuit breakers, bulkheads, rate limiting
Deployment failuresBlue-green, canary, rollback automation
Dependency outageGraceful degradation, fallback behavior
Traffic spikesAuto-scaling, rate limiting, CDN
Configuration errorsConfig-as-code, staged rollout, validation
Resource exhaustionAuto-scaling, resource limits (K8s)

Source