πŸ”„ High Availability & Disaster Recovery#

Learning Objectives#

  • Design for high availability using Multi-AZ, Multi-Region strategies
  • Implement disaster recovery patterns (backup, pilot light, warm standby, multi-site)
  • Understand RTO and RPO and how to achieve them with AWS services

1. HA & DR Fundamentals#

1.1 Key Metrics#

Metric Description Example
RTO (Recovery Time Objective) Max acceptable downtime 1 hour
RPO (Recovery Point Objective) Max acceptable data loss 15 minutes
MTBF Mean time between failures 99.99% availability
MTTR Mean time to recovery Automated failover in seconds
graph LR
    subgraph Timeline["Disaster Recovery Timeline"]
        NOW["🟒 Normal Operation"]
        DISASTER["πŸ’₯ Disaster Strikes\nTimestamp: T+0"]
        RESTORE["βœ… System Restored\nTimestamp: T+RTO"]
    end

    DISASTER -->|RPO: Data loss window| LOSS["πŸ“‰ Last Backup\nData lost since backup"]
    DISASTER -->|RTO: Recovery Time| RESTORE

    style DISASTER fill:#d33,color:#fff
    style RESTORE fill:#1e8900,color:#fff
    style LOSS fill:#888,color:#fff

    NOW -->|Time passes| DISASTER

⚑ Exam Tip: Lower RTO/RPO = higher cost. Design according to business requirements, not technical perfection.

1.2 DR Strategies Comparison#

graph TD
    BACKUP["πŸ’Ύ Backup & Restore\nRTO: Hours\nRPO: 24 hours\nCost: πŸ’°\nUse: Dev/Test, non-critical"]
    
    PILOT["πŸ”¦ Pilot Light\nRTO: ~30 min\nRPO: ~15 min\nCost: πŸ’°πŸ’°\nUse: Medium-critical apps"]
    
    WARM["πŸ”₯ Warm Standby\nRTO: ~5 min\nRPO: ~1 min\nCost: πŸ’°πŸ’°πŸ’°\nUse: Production apps"]
    
    MULTI["🌍 Multi-Site Active-Active\nRTO: Near zero\nRPO: Near zero\nCost: πŸ’°πŸ’°πŸ’°πŸ’°πŸ’°\nUse: Global zero-downtime"]

    BACKUP -->|Faster recovery + Higher cost| PILOT
    PILOT -->|Faster recovery + Higher cost| WARM
    WARM -->|Faster recovery + Higher cost| MULTI

    style BACKUP fill:#01ab5c,color:#fff
    style PILOT fill:#ff9900,color:#fff
    style WARM fill:#527fff,color:#fff
    style MULTI fill:#d33,color:#fff

Cost vs Recovery Trade-Off:

Strategy Cost Complexity RTO RPO
Single AZ Low None Hours Hours
Multi-AZ Medium Low Minutes Minutes
Multi-Region (Active-Passive) High Medium Minutes Minutes
Multi-Region (Active-Active) Highest High Seconds Seconds

2. Multi-AZ HA Patterns#

2.1 Compute HA#

                            Route53 (Health Check)
                                  β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                           β”‚
              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
              β”‚   ALB     β”‚              β”‚   ALB     β”‚
              β”‚ (us-east-1a)            β”‚ (us-east-1b)
              β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                    β”‚                          β”‚
              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
              β”‚ EC2 x 2   β”‚              β”‚ EC2 x 2   β”‚
              β”‚ (ASG)     β”‚              β”‚ (ASG)     β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                        β”‚ RDS Multi-AZβ”‚
                        β”‚ (Primary)   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Services:

  • EC2 β€” Auto Scaling across multiple AZs (min 2 AZs)
  • ALB β€” Cross-zone LB distributes traffic to healthy instances
  • RDS β€” Multi-AZ with standby in another AZ
  • ElastiCache β€” Redis with replica in another AZ
  • NAT Gateway β€” One per AZ (or use NAT Gateway per AZ)

2.2 Database HA#

Service HA Mechanism Failover Time
RDS Multi-AZ Standby replica, sync replication ~60-120 seconds
Aurora 6 copies across 3 AZs, auto-failover ~30 seconds
DynamoDB Multi-AZ by default, auto-healing Instant
ElastiCache Redis Replication group, auto-failover ~10-30 seconds

3. Disaster Recovery Patterns#

3.1 Backup & Restore (RTO: Hours, RPO: 24h)#

Primary Region                          DR Region
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Production   β”‚  Daily snapshots  β”‚   S3 Bucket  β”‚
β”‚  (us-east-1)  │──────────────────>β”‚ (us-west-2)  β”‚
β”‚               β”‚  S3 Cross-Region β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  RDS, EBS,    β”‚  Replication     β”‚   β”‚ Snapshotsβ”‚ β”‚
β”‚  S3 buckets   β”‚                  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                   (Restore when needed)
                                        β”‚
                                   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
                                   β”‚ New Env  β”‚
                                   β”‚ (restore)β”‚
                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost: Lowest β€” pay only for backups and storage Use Case: Non-critical apps, dev/test environments

3.2 Pilot Light (RTO: ~30 min, RPO: ~15 min)#

Primary Region                          DR Region
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Production   β”‚                   β”‚   Pilot Light β”‚
β”‚               β”‚  Replicate data   β”‚               β”‚
β”‚  EC2 (running)│──────────────────>β”‚ RDS (running)  β”‚
β”‚  RDS (active) β”‚                   β”‚ EC2 (stopped)  β”‚
β”‚               β”‚                   β”‚ AMI ready      β”‚
β”‚  Route53      β”‚                   β”‚ Route53 warm   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                  When disaster hits:
                                  1. Start EC2 instances
                                  2. Update Route53 DNS
                                  3. Scale up

Cost: Low-medium (smaller footprint in DR) Use Case: Medium-critical apps, can tolerate ~30 min downtime

3.3 Warm Standby (RTO: ~5 min, RPO: ~1 min)#

Primary Region                          DR Region
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Full Prod   β”‚                   β”‚  Scaled-down Prod β”‚
β”‚               β”‚  Continuous      β”‚                   β”‚
β”‚  EC2 (100%)  │──────────────────>β”‚  EC2 (25%)        β”‚
β”‚  RDS (write)  β”‚  Replication     β”‚  RDS (read-only)  β”‚
β”‚  ALB          β”‚                  β”‚  ALB (routing)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                  When disaster:
                                  1. Scale up EC2
                                  2. Promote RDS to write
                                  3. Shift Route53 traffic

Cost: Medium-high (running scaled-down environment) Use Case: Production apps, business-critical

3.4 Multi-Site Active-Active (RTO: Near 0, RPO: Near 0)#

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Region A         β”‚             β”‚  Region B         β”‚
β”‚  (us-east-1)      β”‚             β”‚  (us-west-2)      β”‚
β”‚                   β”‚             β”‚                   β”‚
β”‚  Route53 (Latency)│─────────────│ Route53 (Latency) β”‚
β”‚                   β”‚             β”‚                   β”‚
β”‚  ALB β†’ EC2 β†’ RDS β”‚             β”‚ ALB β†’ EC2 β†’ RDS  β”‚
β”‚  DynamoDB Global  │─────────────│ DynamoDB Global  β”‚
β”‚  Table            β”‚             β”‚ Table            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost: Highest (full production in 2+ regions) Use Case: Global apps, zero-downtime requirements


4. DR with AWS Services#

4.1 DNS Failover (Route53)#

# Primary health check
aws route53 create-health-check \
  --caller-reference "primary-app-$(date +%s)" \
  --health-check-config '{"Type": "HTTPS", "FullyQualifiedDomainName": "app.primary.com", "Port": 443, "RequestInterval": 10, "FailureThreshold": 3 }'

# Create failover record sets pointing to primary and secondary

4.2 Database DR#

Strategy Service RPO RTO
Cross-Region Snapshot Copy RDS, Aurora 1 day (snapshot schedule) Hours
Cross-Region Read Replica RDS, Aurora < 5 seconds Minutes
Aurora Global Database Aurora < 1 second ~1 minute
DynamoDB Global Tables DynamoDB < 1 second Seconds

Aurora Global Database:

aws rds create-db-cluster \
  --engine aurora-mysql \
  --db-cluster-identifier app-global \
  --master-username admin \
  --master-user-password 'password' \
  --global-cluster-identifier app-global-cluster \
  --storage-encrypted

# Add secondary region
aws rds create-db-cluster \
  --engine aurora-mysql \
  --db-cluster-identifier app-secondary \
  --global-cluster-identifier app-global-cluster \
  --source-region us-east-1 \
  --region eu-west-1

5. ⚑ Exam Tips#

  1. RTO/RPO β€” Driven by business needs, not technical capabilities
  2. Multi-AZ vs Multi-Region β€” Multi-AZ handles AZ failure, Multi-Region handles region failure
  3. Backup & Restore β€” Cheapest but highest RTO/RPO
  4. Pilot Light β€” Core services running, scale up when needed
  5. Warm Standby β€” Scaled-down version running, scale up on failover
  6. Multi-Site β€” Most expensive but lowest RTO/RPO
  7. Aurora Global DB β€” < 1 sec replication, ~1 min failover
  8. DynamoDB Global Tables β€” Multi-region active-active, < 1 sec replication

βœ… Chapter Quiz#

  1. Which metric defines the maximum acceptable data loss in a disaster?

    • A) RTO
    • B) RPO
    • C) MTBF
    • D) MTTR
  2. Which DR strategy has the lowest cost but highest RTO/RPO?

    • A) Backup & Restore
    • B) Pilot Light
    • C) Warm Standby
    • D) Multi-Site
  3. What is the RPO of Aurora Global Database?

    • A) 1 second
    • B) 5 seconds
    • C) 1 minute
    • D) 1 hour
  4. Which AWS service provides Multi-AZ by default?

    • A) RDS (Single-AZ)
    • B) EC2
    • C) DynamoDB
    • D) ElastiCache
  5. In a Warm Standby DR strategy, what state is the DR environment usually in?

    • A) Fully shut down
    • B) Running at reduced capacity
    • C) Running at full capacity
    • D) Only DNS configured
  6. Which AWS service provides automatic failover for RDS across Availability Zones?

    • A) Read Replicas
    • B) Multi-AZ
    • C) Global Database
    • D) Automated backups
  7. A company needs a DR solution with an RTO of less than 1 second and an RPO of near zero. Which strategy should they choose?

    • A) Backup & Restore
    • B) Pilot Light
    • C) Warm Standby
    • D) Multi-Site Active-Active
  8. What is the failover time for Aurora in an AZ failure scenario?

    • A) 60-120 seconds
    • B) ~30 seconds
    • C) ~10 seconds
    • D) Instant
  9. Which Route53 routing policy is used for active-passive failover?

    • A) Simple
    • B) Weighted
    • C) Failover
    • D) Latency
  10. A company needs to replicate an RDS database to another region with less than 5 seconds of lag. Which option should they use?

    • A) Cross-Region Snapshot Copy
    • B) Cross-Region Read Replica
    • C) RDS Multi-AZ
    • D) Database Migration Service
  11. Which DynamoDB feature enables multi-region active-active replication?

    • A) DAX
    • B) Global Tables
    • C) DynamoDB Streams
    • D) TTL
  12. What is the main advantage of AWS Global Accelerator in a multi-region architecture?

    • A) Caches static content at edge locations
    • B) Routes traffic to the optimal endpoint via the AWS global network
    • C) Provides DNS failover
    • D) Decouples application components
  13. In a Pilot Light DR strategy, which resources are typically running in the DR region?

    • A) Full production environment
    • B) Core data services (e.g., database replicas) with compute stopped
    • C) Only S3 backups
    • D) Nothing is running
  14. A company wants to protect against accidental deletion of an S3 object. Which feature should they enable?

    • A) Versioning
    • B) Lifecycle policies
    • C) Cross-region replication
    • D) Transfer Acceleration
  15. What is the purpose of an Elastic Load Balancer health check?

    • A) To monitor CPU utilization
    • B) To route traffic only to healthy instances
    • C) To store logs
    • D) To encrypt traffic
  16. Which DR strategy replicates your application in a scaled-down state in the DR region?

    • A) Backup & Restore
    • B) Pilot Light
    • C) Warm Standby
    • D) Multi-Site
  17. A company needs to automatically replace an unhealthy EC2 instance in an Auto Scaling group. What is this process called?

    • A) Scaling out
    • B) Health check replacement
    • C) Self-healing
    • D) Instance refresh
  18. How many copies of data does Aurora store across 3 Availability Zones?

    • A) 3
    • B) 6
    • C) 9
    • D) 2
  19. What is the primary purpose of an RDS read replica?

    • A) High availability with automatic failover
    • B) Offload read traffic from the primary database
    • C) Disaster recovery with zero RPO
    • D) Data encryption
  20. A company needs a recovery solution where data can be restored from daily snapshots. What is their likely RPO?

    • A) 1 hour
    • B) 15 minutes
    • C) 24 hours
    • D) Near zero
  21. Which AWS service provides DNS-level failover across multiple regions?

    • A) CloudFront
    • B) Route53
    • C) Global Accelerator
    • D) ALB
  22. What is the minimum number of EC2 instances you should run in an Auto Scaling group for high availability?

    • A) 1
    • B) 2
    • C) 3
    • D) 4
  23. A company’s RDS database fails in an AZ outage with Multi-AZ enabled. What happens during failover?

    • A) A read replica is promoted to primary
    • B) The standby instance in another AZ becomes the new primary
    • C) A new instance is launched
    • D) The database is restored from snapshot
  24. What is the RPO of DynamoDB Global Tables?

    • A) Under 1 second
    • B) 5 seconds
    • C) 1 minute
    • D) 5 minutes
  25. Which architecture supports both a Pilot Light and Warm Standby approach depending on the size of the DR footprint?

    • A) Multi-AZ
    • B) Multi-Region active-passive
    • C) Multi-Region active-active
    • D) Single AZ
πŸ“ Answer Key
  1. B β€” RPO defines acceptable data loss (how far back you lose data).
  2. A β€” Backup & Restore is cheapest but hours of RTO/RPO.
  3. A β€” Aurora Global Database has < 1 second replication lag.
  4. C β€” DynamoDB automatically replicates data across 3 AZs.
  5. B β€” Warm Standby runs at reduced capacity, ready to scale up.
  6. B β€” RDS Multi-AZ provides automatic failover to a standby in a different AZ.
  7. D β€” Multi-Site Active-Active provides near-zero RTO and RPO with full production in multiple regions.
  8. B β€” Aurora automatically fails over in ~30 seconds using 6 copies across 3 AZs.
  9. C β€” Route53 Failover routing policy directs traffic to a primary resource with a secondary failover.
  10. B β€” Cross-Region Read Replicas provide asynchronous replication with < 5 seconds lag.
  11. B β€” DynamoDB Global Tables replicate data across regions with multi-region active-active support.
  12. B β€” Global Accelerator routes traffic to optimal endpoints over the AWS global network.
  13. B β€” Pilot Light keeps core data services (like DB replicas) running but compute instances stopped.
  14. A β€” S3 Versioning protects against accidental deletion by preserving previous object versions.
  15. B β€” Health checks monitor instance health and route traffic only to healthy targets.
  16. C β€” Warm Standby runs a scaled-down version of production in the DR region.
  17. C β€” Self-healing automatically replaces unhealthy EC2 instances in an Auto Scaling group.
  18. B β€” Aurora stores 6 copies of data across 3 AZs for durability and availability.
  19. B β€” Read replicas offload read traffic by providing additional read-only database endpoints.
  20. C β€” Daily snapshots mean up to 24 hours of potential data loss (24-hour RPO).
  21. B β€” Route53 provides DNS failover routing with health checks for multi-region failover.
  22. B β€” At least 2 instances across 2 AZs ensure high availability in an Auto Scaling group.
  23. B β€” Multi-AZ promotes the standby instance to primary with automatic DNS update.
  24. A β€” DynamoDB Global Tables replicates data within under 1 second across regions.
  25. B β€” Multi-Region active-passive supports both Pilot Light (minimal DR) and Warm Standby (scaled DR).

πŸ“š Additional Resources#

Next β†’ Cost Optimization