π High Availability & Disaster Recovery#
Learning Objectives#
- Design for high availability using Multi-AZ, Multi-Region strategies
- Implement disaster recovery patterns (backup, pilot light, warm standby, multi-site)
- Understand RTO and RPO and how to achieve them with AWS services
1. HA & DR Fundamentals#
1.1 Key Metrics#
| Metric | Description | Example |
|---|---|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | 1 hour |
| RPO (Recovery Point Objective) | Max acceptable data loss | 15 minutes |
| MTBF | Mean time between failures | 99.99% availability |
| MTTR | Mean time to recovery | Automated failover in seconds |
graph LR
subgraph Timeline["Disaster Recovery Timeline"]
NOW["π’ Normal Operation"]
DISASTER["π₯ Disaster Strikes\nTimestamp: T+0"]
RESTORE["β
System Restored\nTimestamp: T+RTO"]
end
DISASTER -->|RPO: Data loss window| LOSS["π Last Backup\nData lost since backup"]
DISASTER -->|RTO: Recovery Time| RESTORE
style DISASTER fill:#d33,color:#fff
style RESTORE fill:#1e8900,color:#fff
style LOSS fill:#888,color:#fff
NOW -->|Time passes| DISASTERβ‘ Exam Tip: Lower RTO/RPO = higher cost. Design according to business requirements, not technical perfection.
1.2 DR Strategies Comparison#
graph TD
BACKUP["πΎ Backup & Restore\nRTO: Hours\nRPO: 24 hours\nCost: π°\nUse: Dev/Test, non-critical"]
PILOT["π¦ Pilot Light\nRTO: ~30 min\nRPO: ~15 min\nCost: π°π°\nUse: Medium-critical apps"]
WARM["π₯ Warm Standby\nRTO: ~5 min\nRPO: ~1 min\nCost: π°π°π°\nUse: Production apps"]
MULTI["π Multi-Site Active-Active\nRTO: Near zero\nRPO: Near zero\nCost: π°π°π°π°π°\nUse: Global zero-downtime"]
BACKUP -->|Faster recovery + Higher cost| PILOT
PILOT -->|Faster recovery + Higher cost| WARM
WARM -->|Faster recovery + Higher cost| MULTI
style BACKUP fill:#01ab5c,color:#fff
style PILOT fill:#ff9900,color:#fff
style WARM fill:#527fff,color:#fff
style MULTI fill:#d33,color:#fffCost vs Recovery Trade-Off:
| Strategy | Cost | Complexity | RTO | RPO |
|---|---|---|---|---|
| Single AZ | Low | None | Hours | Hours |
| Multi-AZ | Medium | Low | Minutes | Minutes |
| Multi-Region (Active-Passive) | High | Medium | Minutes | Minutes |
| Multi-Region (Active-Active) | Highest | High | Seconds | Seconds |
2. Multi-AZ HA Patterns#
2.1 Compute HA#
Route53 (Health Check)
β
βββββββββββββββ΄ββββββββββββββ
β β
βββββββ΄ββββββ βββββββ΄ββββββ
β ALB β β ALB β
β (us-east-1a) β (us-east-1b)
βββββββ¬ββββββ βββββββ¬ββββββ
β β
βββββββ΄ββββββ βββββββ΄ββββββ
β EC2 x 2 β β EC2 x 2 β
β (ASG) β β (ASG) β
βββββββββββββ βββββββββββββ
β β
ββββββββββββ¬ββββββββββββββββ
β
ββββββββ΄βββββββ
β RDS Multi-AZβ
β (Primary) β
βββββββββββββββKey Services:
- EC2 β Auto Scaling across multiple AZs (min 2 AZs)
- ALB β Cross-zone LB distributes traffic to healthy instances
- RDS β Multi-AZ with standby in another AZ
- ElastiCache β Redis with replica in another AZ
- NAT Gateway β One per AZ (or use NAT Gateway per AZ)
2.2 Database HA#
| Service | HA Mechanism | Failover Time |
|---|---|---|
| RDS Multi-AZ | Standby replica, sync replication | ~60-120 seconds |
| Aurora | 6 copies across 3 AZs, auto-failover | ~30 seconds |
| DynamoDB | Multi-AZ by default, auto-healing | Instant |
| ElastiCache Redis | Replication group, auto-failover | ~10-30 seconds |
3. Disaster Recovery Patterns#
3.1 Backup & Restore (RTO: Hours, RPO: 24h)#
Primary Region DR Region
ββββββββββββββββ ββββββββββββββββ
β Production β Daily snapshots β S3 Bucket β
β (us-east-1) βββββββββββββββββββ>β (us-west-2) β
β β S3 Cross-Region β ββββββββββ β
β RDS, EBS, β Replication β β Snapshotsβ β
β S3 buckets β β ββββββββββ β
ββββββββββββββββ ββββββββββββββββ
β
(Restore when needed)
β
ββββββ΄βββββ
β New Env β
β (restore)β
βββββββββββCost: Lowest β pay only for backups and storage Use Case: Non-critical apps, dev/test environments
3.2 Pilot Light (RTO: ~30 min, RPO: ~15 min)#
Primary Region DR Region
ββββββββββββββββ ββββββββββββββββ
β Production β β Pilot Light β
β β Replicate data β β
β EC2 (running)βββββββββββββββββββ>β RDS (running) β
β RDS (active) β β EC2 (stopped) β
β β β AMI ready β
β Route53 β β Route53 warm β
ββββββββββββββββ ββββββββββββββββ
β
When disaster hits:
1. Start EC2 instances
2. Update Route53 DNS
3. Scale upCost: Low-medium (smaller footprint in DR) Use Case: Medium-critical apps, can tolerate ~30 min downtime
3.3 Warm Standby (RTO: ~5 min, RPO: ~1 min)#
Primary Region DR Region
ββββββββββββββββ ββββββββββββββββββββ
β Full Prod β β Scaled-down Prod β
β β Continuous β β
β EC2 (100%) βββββββββββββββββββ>β EC2 (25%) β
β RDS (write) β Replication β RDS (read-only) β
β ALB β β ALB (routing) β
ββββββββββββββββ ββββββββββββββββββββ
β
When disaster:
1. Scale up EC2
2. Promote RDS to write
3. Shift Route53 trafficCost: Medium-high (running scaled-down environment) Use Case: Production apps, business-critical
3.4 Multi-Site Active-Active (RTO: Near 0, RPO: Near 0)#
ββββββββββββββββββββ ββββββββββββββββββββ
β Region A β β Region B β
β (us-east-1) β β (us-west-2) β
β β β β
β Route53 (Latency)βββββββββββββββ Route53 (Latency) β
β β β β
β ALB β EC2 β RDS β β ALB β EC2 β RDS β
β DynamoDB Global βββββββββββββββ DynamoDB Global β
β Table β β Table β
ββββββββββββββββββββ ββββββββββββββββββββCost: Highest (full production in 2+ regions) Use Case: Global apps, zero-downtime requirements
4. DR with AWS Services#
4.1 DNS Failover (Route53)#
# Primary health check
aws route53 create-health-check \
--caller-reference "primary-app-$(date +%s)" \
--health-check-config '{"Type": "HTTPS", "FullyQualifiedDomainName": "app.primary.com", "Port": 443, "RequestInterval": 10, "FailureThreshold": 3 }'
# Create failover record sets pointing to primary and secondary4.2 Database DR#
| Strategy | Service | RPO | RTO |
|---|---|---|---|
| Cross-Region Snapshot Copy | RDS, Aurora | 1 day (snapshot schedule) | Hours |
| Cross-Region Read Replica | RDS, Aurora | < 5 seconds | Minutes |
| Aurora Global Database | Aurora | < 1 second | ~1 minute |
| DynamoDB Global Tables | DynamoDB | < 1 second | Seconds |
Aurora Global Database:
aws rds create-db-cluster \
--engine aurora-mysql \
--db-cluster-identifier app-global \
--master-username admin \
--master-user-password 'password' \
--global-cluster-identifier app-global-cluster \
--storage-encrypted
# Add secondary region
aws rds create-db-cluster \
--engine aurora-mysql \
--db-cluster-identifier app-secondary \
--global-cluster-identifier app-global-cluster \
--source-region us-east-1 \
--region eu-west-15. β‘ Exam Tips#
- RTO/RPO β Driven by business needs, not technical capabilities
- Multi-AZ vs Multi-Region β Multi-AZ handles AZ failure, Multi-Region handles region failure
- Backup & Restore β Cheapest but highest RTO/RPO
- Pilot Light β Core services running, scale up when needed
- Warm Standby β Scaled-down version running, scale up on failover
- Multi-Site β Most expensive but lowest RTO/RPO
- Aurora Global DB β < 1 sec replication, ~1 min failover
- DynamoDB Global Tables β Multi-region active-active, < 1 sec replication
β Chapter Quiz#
-
Which metric defines the maximum acceptable data loss in a disaster?
- A) RTO
- B) RPO
- C) MTBF
- D) MTTR
-
Which DR strategy has the lowest cost but highest RTO/RPO?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site
-
What is the RPO of Aurora Global Database?
- A) 1 second
- B) 5 seconds
- C) 1 minute
- D) 1 hour
-
Which AWS service provides Multi-AZ by default?
- A) RDS (Single-AZ)
- B) EC2
- C) DynamoDB
- D) ElastiCache
-
In a Warm Standby DR strategy, what state is the DR environment usually in?
- A) Fully shut down
- B) Running at reduced capacity
- C) Running at full capacity
- D) Only DNS configured
-
Which AWS service provides automatic failover for RDS across Availability Zones?
- A) Read Replicas
- B) Multi-AZ
- C) Global Database
- D) Automated backups
-
A company needs a DR solution with an RTO of less than 1 second and an RPO of near zero. Which strategy should they choose?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site Active-Active
-
What is the failover time for Aurora in an AZ failure scenario?
- A) 60-120 seconds
- B) ~30 seconds
- C) ~10 seconds
- D) Instant
-
Which Route53 routing policy is used for active-passive failover?
- A) Simple
- B) Weighted
- C) Failover
- D) Latency
-
A company needs to replicate an RDS database to another region with less than 5 seconds of lag. Which option should they use?
- A) Cross-Region Snapshot Copy
- B) Cross-Region Read Replica
- C) RDS Multi-AZ
- D) Database Migration Service
-
Which DynamoDB feature enables multi-region active-active replication?
- A) DAX
- B) Global Tables
- C) DynamoDB Streams
- D) TTL
-
What is the main advantage of AWS Global Accelerator in a multi-region architecture?
- A) Caches static content at edge locations
- B) Routes traffic to the optimal endpoint via the AWS global network
- C) Provides DNS failover
- D) Decouples application components
-
In a Pilot Light DR strategy, which resources are typically running in the DR region?
- A) Full production environment
- B) Core data services (e.g., database replicas) with compute stopped
- C) Only S3 backups
- D) Nothing is running
-
A company wants to protect against accidental deletion of an S3 object. Which feature should they enable?
- A) Versioning
- B) Lifecycle policies
- C) Cross-region replication
- D) Transfer Acceleration
-
What is the purpose of an Elastic Load Balancer health check?
- A) To monitor CPU utilization
- B) To route traffic only to healthy instances
- C) To store logs
- D) To encrypt traffic
-
Which DR strategy replicates your application in a scaled-down state in the DR region?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site
-
A company needs to automatically replace an unhealthy EC2 instance in an Auto Scaling group. What is this process called?
- A) Scaling out
- B) Health check replacement
- C) Self-healing
- D) Instance refresh
-
How many copies of data does Aurora store across 3 Availability Zones?
- A) 3
- B) 6
- C) 9
- D) 2
-
What is the primary purpose of an RDS read replica?
- A) High availability with automatic failover
- B) Offload read traffic from the primary database
- C) Disaster recovery with zero RPO
- D) Data encryption
-
A company needs a recovery solution where data can be restored from daily snapshots. What is their likely RPO?
- A) 1 hour
- B) 15 minutes
- C) 24 hours
- D) Near zero
-
Which AWS service provides DNS-level failover across multiple regions?
- A) CloudFront
- B) Route53
- C) Global Accelerator
- D) ALB
-
What is the minimum number of EC2 instances you should run in an Auto Scaling group for high availability?
- A) 1
- B) 2
- C) 3
- D) 4
-
A company’s RDS database fails in an AZ outage with Multi-AZ enabled. What happens during failover?
- A) A read replica is promoted to primary
- B) The standby instance in another AZ becomes the new primary
- C) A new instance is launched
- D) The database is restored from snapshot
-
What is the RPO of DynamoDB Global Tables?
- A) Under 1 second
- B) 5 seconds
- C) 1 minute
- D) 5 minutes
-
Which architecture supports both a Pilot Light and Warm Standby approach depending on the size of the DR footprint?
- A) Multi-AZ
- B) Multi-Region active-passive
- C) Multi-Region active-active
- D) Single AZ
π Answer Key
- B β RPO defines acceptable data loss (how far back you lose data).
- A β Backup & Restore is cheapest but hours of RTO/RPO.
- A β Aurora Global Database has < 1 second replication lag.
- C β DynamoDB automatically replicates data across 3 AZs.
- B β Warm Standby runs at reduced capacity, ready to scale up.
- B β RDS Multi-AZ provides automatic failover to a standby in a different AZ.
- D β Multi-Site Active-Active provides near-zero RTO and RPO with full production in multiple regions.
- B β Aurora automatically fails over in ~30 seconds using 6 copies across 3 AZs.
- C β Route53 Failover routing policy directs traffic to a primary resource with a secondary failover.
- B β Cross-Region Read Replicas provide asynchronous replication with < 5 seconds lag.
- B β DynamoDB Global Tables replicate data across regions with multi-region active-active support.
- B β Global Accelerator routes traffic to optimal endpoints over the AWS global network.
- B β Pilot Light keeps core data services (like DB replicas) running but compute instances stopped.
- A β S3 Versioning protects against accidental deletion by preserving previous object versions.
- B β Health checks monitor instance health and route traffic only to healthy targets.
- C β Warm Standby runs a scaled-down version of production in the DR region.
- C β Self-healing automatically replaces unhealthy EC2 instances in an Auto Scaling group.
- B β Aurora stores 6 copies of data across 3 AZs for durability and availability.
- B β Read replicas offload read traffic by providing additional read-only database endpoints.
- C β Daily snapshots mean up to 24 hours of potential data loss (24-hour RPO).
- B β Route53 provides DNS failover routing with health checks for multi-region failover.
- B β At least 2 instances across 2 AZs ensure high availability in an Auto Scaling group.
- B β Multi-AZ promotes the standby instance to primary with automatic DNS update.
- A β DynamoDB Global Tables replicates data within under 1 second across regions.
- B β Multi-Region active-passive supports both Pilot Light (minimal DR) and Warm Standby (scaled DR).
π Additional Resources#
- DR on AWS Whitepaper
- Well-Architected Reliability Pillar
- Aurora Global Database
- Route53 Routing Policies
Next β Cost Optimization