1.3 Design Reliable and Resilient Architectures
Disaster Recovery (DR)
Concepts
- RTO (Recovery Time Objective): How long can you be down? (Time to recover).
- RPO (Recovery Point Objective): How much data can you lose? (Time since last backup).
DR Strategies (Cheatsheet)
| Strategy | Description | Cost | RTO/RPO |
|---|---|---|---|
| Backup and Restore | Restore from S3/Tape to fresh infra | Lowest | Hours/Days |
| Pilot Light | Core services (DB) running, App servers off (AMI) | Low | Minutes/Hours |
| Warm Standby | Scaled-down version always running, scale up on failover | Medium | Minutes |
| Multi-Site Active/Active | Full capacity running in both regions | Highest | Near Zero |
WARNING
Exam Gotcha: Pilot Light means the database is running (replicating), but the app servers are off (just AMIs). Warm Standby means the app servers are running but at minimum capacity.
AWS Backup
- Function: Centralized backup service for EBS, RDS, DynamoDB, EFS, FSx, and S3.
- Cross-Region Copy: Automatically copy backups to a DR region for compliance/DR.
- Vault Lock: WORM (Write Once Read Many) compliance protection for backups.
AWS Elastic Disaster Recovery (DRS)
- Mechanism: Continuous block-level replication of on-prem/cloud servers to a low-cost staging area in AWS.
- Failover: Launches full-size EC2 instances only during a drill or disaster.
Availability Patterns
- Scale-out (Horizontal): Adding more instances (Auto Scaling). Preferred for stateless apps.
- Scale-up (Vertical): Increasing instance size (t3.medium → t3.large). Requires downtime (reboot).
- Auto-recovery: EC2 Auto Recovery moves an instance to healthy hardware if system status checks fail.
WARNING
Exam Gotcha: EC2 Auto Recovery retains the Instance ID, Private IP, and Elastic IP. It does not recover from Instance Status Checks (OS errors), only System Status Checks (Hardware errors).