Disaster Recovery

Disaster recovery (DR) is the ability to restore full service after a major failure that takes out one or more critical components. Unlike standard reliability (single component failure), DR addresses catastrophic failures — entire regions, data corruption, ransomware, natural disasters.

RPO and RTO

Two numbers define your DR posture:

  • RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. “RPO = 1 hour” means you must lose no more than 1 hour of data.
  • RTO (Recovery Time Objective) — maximum acceptable downtime. “RTO = 4 hours” means you must restore service within 4 hours.
RPO = how much data can we afford to lose?
RTO = how long can we afford to be down?

These are business decisions, not technical ones. Finance cares about RPO (transactions), Operations cares about RTO (service restoration).

CriticalityRPORTOExample
Mission-critical0 (synchronous replication)< 15 minFinancial trading, emergency services
Business-critical< 1 hour< 4 hoursE-commerce checkout, ERP
Business-operational< 24 hours< 24 hoursInternal tools, reporting
Low-critical< 1 week< 1 weekAnalytics, data pipelines

DR Strategies

Strategy 1: Backup and Restore

Periodically back up data, restore from backup after disaster.

Nightly backup → restore takes hours → RTO = hours to days
Weekly backup → restore takes days → RTO = days

Pros: Simple, low cost, works for any data store. Cons: High RPO (all data since last backup is lost), high RTO (restore takes time).

When to use: RTO > 24 hours acceptable, data not frequently changed, budget constrained.

Strategy 2: Pilot Light

A minimal version of your infrastructure is always running in the secondary region. Core data is replicated, but app servers are off until needed.

Primary: Full infrastructure (all services running)
Secondary: Pilot light (DB replica, minimal compute, storage)

Disaster:
1. Scale up secondary app servers (minutes)
  2. Update DNS to secondary (DNS TTL must be low)
  3. Restore from replicated data (already there)
  RTO: 15 minutes to 2 hours

Pros: Lower cost than full hot standby, reasonable RTO. Cons: Secondary isn’t production-tested until disaster strikes, data replication lag.

When to use: RTO in hours acceptable, budget constraints, regional disaster scenario.

Strategy 3: Warm Standby

Secondary region has scaled-down but functional infrastructure. Data is synchronously or near-synchronously replicated.

Primary: Full infrastructure
Secondary: Warm standby (10-20% capacity, scaled down)

Disaster:
  1. Scale up secondary to full capacity (auto-scale, 5-15 minutes)
  2. DNS failover (1-5 minutes with low TTL)
  RTO: 15-60 minutes

Pros: Faster failover than pilot light, regularly tested (at reduced scale). Cons: Higher cost (standby always running), still a gap between normal and DR capacity.

When to use: RTO in minutes to hours, some budget for standby infrastructure.

Strategy 4: Multi-Region Active-Active

Two or more regions serve traffic simultaneously. If one region fails, the other absorbs all traffic.

Region A (50% traffic) ←→ Data (synchronous replication) ←→ Region B (50% traffic)
Region C (geo-distributed reads) ←→ Data (async replication)

Pros: Near-zero RTO for regional failure, no “disaster mode” activation needed. Cons: Maximum complexity, maximum cost, cross-region writes are slow (latency penalty).

When to use: RTO < 15 minutes, revenue-critical systems, global user base with latency requirements.

Data Replication Patterns

Synchronous Replication

Write confirmed only when both primary and replica acknowledge.

Client → Primary → Replica (ack) → Client (write confirmed)
  • RPO = 0 for replica failure (no data loss)
  • Latency penalty = one-way network latency to replica
  • Availability penalty — write blocked if replica is down

Asynchronous Replication

Write confirmed on primary, replicated to replica in background.

Client → Primary (ack immediately) → Replica (async, background)
  • Latency penalty = minimal (only primary ack required)
  • RPO > 0 — some data loss if primary fails before replication
  • Replica can lag — network issues cause replication backlog

Change Data Capture (CDC)

Capture database changes as events and stream to secondary:

Primary DB → CDC (Debezium, AWS DMS) → Message Queue → Secondary DB
  • Near-real-time — typically seconds of lag
  • Schema evolution handling — complex but powerful
  • Works across heterogeneous stores — migrate from MySQL to Postgres

DR Topology

Same-Region, Different AZs

Cheapest multi-AZ setup. Protects against single AZ failure, not against region-level disasters.

us-east-1a (primary) ←→ us-east-1b (secondary AZ)
  ↑ RDS Multi-AZ (synchronous)

Cross-Region

Protects against regional disasters. Choose regions with geographic separation:

us-east-1 (primary) ←→ eu-west-1 (secondary)
us-west-2 (primary) ←→ ap-southeast-1 (secondary)

Considerations:

  • Data residency laws (data can’t leave certain jurisdictions)
  • Cross-region latency (all writes have added latency)
  • Cost differences between regions

Failover Procedures

DR is only as good as the failover procedure. Document and test it.

DNS Failover

Normal: api.example.com → ALIAS → CloudFront → LB → us-east-1
Disaster: Change ALIAS → us-west-2

Requirements:

  • Low TTL on DNS records (60 seconds or less)
  • Health checks before failover (don’t fail over to a also-failing region)
  • Automated DNS update (Route53 health check + failover routing)

Database Failover

Primary (us-east-1) → Replica (us-west-2)
Disaster:
  1. Promote replica to primary ( RDS promote-read-replica)
  2. Update connection strings (or use a proxy/endpoint)
  3. Verify data integrity
  4. DNS failover to new primary

Application Failover

Disaster:
  1. Verify secondary region is healthy (health checks)
  2. Scale secondary app tier to full capacity
  3. Update DNS / load balancer
  4. Clear cached state (Redis in secondary region)
  5. Verify end-to-end functionality

Testing DR

A DR plan that hasn’t been tested is not a DR plan.

Types of DR Testing

Test TypeWhat it exercisesFrequency
Tabletop exerciseWalk through procedure, identify gapsQuarterly
Partial failoverFail over one component, verify recoveryMonthly
Full failoverComplete DR drill, measure actual RTOAnnually
Chaos injectionDeliberately destroy components, verify recoveryMonthly

Measuring Actual RTO

Run a real failover and measure:

t_discovery = time failure is detected
t_notification = time on-call is paged
t_diagnosis = time to understand root cause
t_decision = time to decide on DR action
t_execution = time to execute DR procedure
t_verification = time to verify service restored

RTO = t_discovery + t_notification + t_diagnosis + t_decision + t_execution + t_verification

Most companies discover their RTO is 2-10x their planned RTO when they first test.

Common DR Mistakes

  • DR plan not documented — procedure lives in one engineer’s head
  • DNS TTL too high — failover takes hours because records cached everywhere
  • No regular DR testing — plan is untested until disaster strikes
  • Replication lag not monitored — replica is days behind and nobody noticed
  • Secrets not replicated — credentials needed for failover aren’t in secondary region
  • Data in the wrong region — compliance data residency violated during failover
  • Single vendor dependency — cloud provider outage takes out both primary and DR