1.3 Design Reliable and Resilient Architectures

Disaster Recovery (DR)

Concepts

  • RTO (Recovery Time Objective): How long can you be down? (Time to recover).
  • RPO (Recovery Point Objective): How much data can you lose? (Time since last backup).

DR Strategies (Cheatsheet)

StrategyDescriptionCostRTO/RPO
Backup and RestoreRestore from S3/Tape to fresh infraLowestHours/Days
Pilot LightCore services (DB) running, App servers off (AMI)LowMinutes/Hours
Warm StandbyScaled-down version always running, scale up on failoverMediumMinutes
Multi-Site Active/ActiveFull capacity running in both regionsHighestNear Zero

WARNING

Exam Gotcha: Pilot Light means the database is running (replicating), but the app servers are off (just AMIs). Warm Standby means the app servers are running but at minimum capacity.

AWS Backup

  • Function: Centralized backup service for EBS, RDS, DynamoDB, EFS, FSx, and S3.
  • Cross-Region Copy: Automatically copy backups to a DR region for compliance/DR.
  • Vault Lock: WORM (Write Once Read Many) compliance protection for backups.

AWS Elastic Disaster Recovery (DRS)

  • Mechanism: Continuous block-level replication of on-prem/cloud servers to a low-cost staging area in AWS.
  • Failover: Launches full-size EC2 instances only during a drill or disaster.

Availability Patterns

  • Scale-out (Horizontal): Adding more instances (Auto Scaling). Preferred for stateless apps.
  • Scale-up (Vertical): Increasing instance size (t3.medium t3.large). Requires downtime (reboot).
  • Auto-recovery: EC2 Auto Recovery moves an instance to healthy hardware if system status checks fail.

WARNING

Exam Gotcha: EC2 Auto Recovery retains the Instance ID, Private IP, and Elastic IP. It does not recover from Instance Status Checks (OS errors), only System Status Checks (Hardware errors).