🔄 High Availability & Disaster Recovery#

Learning Objectives#

Design for high availability using Multi-AZ, Multi-Region strategies
Implement disaster recovery patterns (backup, pilot light, warm standby, multi-site)
Understand RTO and RPO and how to achieve them with AWS services

1. HA & DR Fundamentals#

1.1 Key Metrics#

Metric	Description	Example
RTO (Recovery Time Objective)	Max acceptable downtime	1 hour
RPO (Recovery Point Objective)	Max acceptable data loss	15 minutes
MTBF	Mean time between failures	99.99% availability
MTTR	Mean time to recovery	Automated failover in seconds

graph LR
    subgraph Timeline["Disaster Recovery Timeline"]
        NOW["🟢 Normal Operation"]
        DISASTER["💥 Disaster Strikes\nTimestamp: T+0"]
        RESTORE["✅ System Restored\nTimestamp: T+RTO"]
    end

    DISASTER -->|RPO: Data loss window| LOSS["📉 Last Backup\nData lost since backup"]
    DISASTER -->|RTO: Recovery Time| RESTORE

    style DISASTER fill:#d33,color:#fff
    style RESTORE fill:#1e8900,color:#fff
    style LOSS fill:#888,color:#fff

    NOW -->|Time passes| DISASTER

⚡ Exam Tip: Lower RTO/RPO = higher cost. Design according to business requirements, not technical perfection.

1.2 DR Strategies Comparison#

graph TD
    BACKUP["💾 Backup & Restore\nRTO: Hours\nRPO: 24 hours\nCost: 💰\nUse: Dev/Test, non-critical"]
    
    PILOT["🔦 Pilot Light\nRTO: ~30 min\nRPO: ~15 min\nCost: 💰💰\nUse: Medium-critical apps"]
    
    WARM["🔥 Warm Standby\nRTO: ~5 min\nRPO: ~1 min\nCost: 💰💰💰\nUse: Production apps"]
    
    MULTI["🌍 Multi-Site Active-Active\nRTO: Near zero\nRPO: Near zero\nCost: 💰💰💰💰💰\nUse: Global zero-downtime"]

    BACKUP -->|Faster recovery + Higher cost| PILOT
    PILOT -->|Faster recovery + Higher cost| WARM
    WARM -->|Faster recovery + Higher cost| MULTI

    style BACKUP fill:#01ab5c,color:#fff
    style PILOT fill:#ff9900,color:#fff
    style WARM fill:#527fff,color:#fff
    style MULTI fill:#d33,color:#fff

Cost vs Recovery Trade-Off:

Strategy	Cost	Complexity	RTO	RPO
Single AZ	Low	None	Hours	Hours
Multi-AZ	Medium	Low	Minutes	Minutes
Multi-Region (Active-Passive)	High	Medium	Minutes	Minutes
Multi-Region (Active-Active)	Highest	High	Seconds	Seconds

2. Multi-AZ HA Patterns#

2.1 Compute HA#

                            Route53 (Health Check)
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
              ┌─────┴─────┐              ┌─────┴─────┐
              │   ALB     │              │   ALB     │
              │ (us-east-1a)            │ (us-east-1b)
              └─────┬─────┘              └─────┬─────┘
                    │                          │
              ┌─────┴─────┐              ┌─────┴─────┐
              │ EC2 x 2   │              │ EC2 x 2   │
              │ (ASG)     │              │ (ASG)     │
              └───────────┘              └───────────┘
                    │                          │
                    └──────────┬───────────────┘
                               │
                        ┌──────┴──────┐
                        │ RDS Multi-AZ│
                        │ (Primary)   │
                        └─────────────┘

Key Services:

EC2 — Auto Scaling across multiple AZs (min 2 AZs)
ALB — Cross-zone LB distributes traffic to healthy instances
RDS — Multi-AZ with standby in another AZ
ElastiCache — Redis with replica in another AZ
NAT Gateway — One per AZ (or use NAT Gateway per AZ)

2.2 Database HA#

Service	HA Mechanism	Failover Time
RDS Multi-AZ	Standby replica, sync replication	~60-120 seconds
Aurora	6 copies across 3 AZs, auto-failover	~30 seconds
DynamoDB	Multi-AZ by default, auto-healing	Instant
ElastiCache Redis	Replication group, auto-failover	~10-30 seconds

3. Disaster Recovery Patterns#

3.1 Backup & Restore (RTO: Hours, RPO: 24h)#

Primary Region                          DR Region
┌──────────────┐                   ┌──────────────┐
│  Production   │  Daily snapshots  │   S3 Bucket  │
│  (us-east-1)  │──────────────────>│ (us-west-2)  │
│               │  S3 Cross-Region │   ┌────────┐ │
│  RDS, EBS,    │  Replication     │   │ Snapshots│ │
│  S3 buckets   │                  │   └────────┘ │
└──────────────┘                   └──────────────┘
                                        │
                                   (Restore when needed)
                                        │
                                   ┌────┴────┐
                                   │ New Env  │
                                   │ (restore)│
                                   └─────────┘

Cost: Lowest — pay only for backups and storage Use Case: Non-critical apps, dev/test environments

3.2 Pilot Light (RTO: ~30 min, RPO: ~15 min)#

Primary Region                          DR Region
┌──────────────┐                   ┌──────────────┐
│  Production   │                   │   Pilot Light │
│               │  Replicate data   │               │
│  EC2 (running)│──────────────────>│ RDS (running)  │
│  RDS (active) │                   │ EC2 (stopped)  │
│               │                   │ AMI ready      │
│  Route53      │                   │ Route53 warm   │
└──────────────┘                   └──────────────┘
                                        │
                                  When disaster hits:
                                  1. Start EC2 instances
                                  2. Update Route53 DNS
                                  3. Scale up

Cost: Low-medium (smaller footprint in DR) Use Case: Medium-critical apps, can tolerate ~30 min downtime

3.3 Warm Standby (RTO: ~5 min, RPO: ~1 min)#

Primary Region                          DR Region
┌──────────────┐                   ┌──────────────────┐
│  Full Prod   │                   │  Scaled-down Prod │
│               │  Continuous      │                   │
│  EC2 (100%)  │──────────────────>│  EC2 (25%)        │
│  RDS (write)  │  Replication     │  RDS (read-only)  │
│  ALB          │                  │  ALB (routing)     │
└──────────────┘                   └──────────────────┘
                                        │
                                  When disaster:
                                  1. Scale up EC2
                                  2. Promote RDS to write
                                  3. Shift Route53 traffic

Cost: Medium-high (running scaled-down environment) Use Case: Production apps, business-critical

3.4 Multi-Site Active-Active (RTO: Near 0, RPO: Near 0)#

┌──────────────────┐             ┌──────────────────┐
│  Region A         │             │  Region B         │
│  (us-east-1)      │             │  (us-west-2)      │
│                   │             │                   │
│  Route53 (Latency)│─────────────│ Route53 (Latency) │
│                   │             │                   │
│  ALB → EC2 → RDS │             │ ALB → EC2 → RDS  │
│  DynamoDB Global  │─────────────│ DynamoDB Global  │
│  Table            │             │ Table            │
└──────────────────┘             └──────────────────┘

Cost: Highest (full production in 2+ regions) Use Case: Global apps, zero-downtime requirements

4. DR with AWS Services#

4.1 DNS Failover (Route53)#

# Primary health check
aws route53 create-health-check \
  --caller-reference "primary-app-$(date +%s)" \
  --health-check-config '{"Type": "HTTPS", "FullyQualifiedDomainName": "app.primary.com", "Port": 443, "RequestInterval": 10, "FailureThreshold": 3 }'

# Create failover record sets pointing to primary and secondary

4.2 Database DR#

Strategy	Service	RPO	RTO
Cross-Region Snapshot Copy	RDS, Aurora	1 day (snapshot schedule)	Hours
Cross-Region Read Replica	RDS, Aurora	< 5 seconds	Minutes
Aurora Global Database	Aurora	< 1 second	~1 minute
DynamoDB Global Tables	DynamoDB	< 1 second	Seconds

Aurora Global Database:

aws rds create-db-cluster \
  --engine aurora-mysql \
  --db-cluster-identifier app-global \
  --master-username admin \
  --master-user-password 'password' \
  --global-cluster-identifier app-global-cluster \
  --storage-encrypted

# Add secondary region
aws rds create-db-cluster \
  --engine aurora-mysql \
  --db-cluster-identifier app-secondary \
  --global-cluster-identifier app-global-cluster \
  --source-region us-east-1 \
  --region eu-west-1

5. ⚡ Exam Tips#

RTO/RPO — Driven by business needs, not technical capabilities
Multi-AZ vs Multi-Region — Multi-AZ handles AZ failure, Multi-Region handles region failure
Backup & Restore — Cheapest but highest RTO/RPO
Pilot Light — Core services running, scale up when needed
Warm Standby — Scaled-down version running, scale up on failover
Multi-Site — Most expensive but lowest RTO/RPO
Aurora Global DB — < 1 sec replication, ~1 min failover
DynamoDB Global Tables — Multi-region active-active, < 1 sec replication

✅ Chapter Quiz#

Which metric defines the maximum acceptable data loss in a disaster?
- A) RTO
- B) RPO
- C) MTBF
- D) MTTR
Which DR strategy has the lowest cost but highest RTO/RPO?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site
What is the RPO of Aurora Global Database?
- A) 1 second
- B) 5 seconds
- C) 1 minute
- D) 1 hour
Which AWS service provides Multi-AZ by default?
- A) RDS (Single-AZ)
- B) EC2
- C) DynamoDB
- D) ElastiCache
In a Warm Standby DR strategy, what state is the DR environment usually in?
- A) Fully shut down
- B) Running at reduced capacity
- C) Running at full capacity
- D) Only DNS configured
Which AWS service provides automatic failover for RDS across Availability Zones?
- A) Read Replicas
- B) Multi-AZ
- C) Global Database
- D) Automated backups
A company needs a DR solution with an RTO of less than 1 second and an RPO of near zero. Which strategy should they choose?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site Active-Active
What is the failover time for Aurora in an AZ failure scenario?
- A) 60-120 seconds
- B) ~30 seconds
- C) ~10 seconds
- D) Instant
Which Route53 routing policy is used for active-passive failover?
- A) Simple
- B) Weighted
- C) Failover
- D) Latency
A company needs to replicate an RDS database to another region with less than 5 seconds of lag. Which option should they use?
- A) Cross-Region Snapshot Copy
- B) Cross-Region Read Replica
- C) RDS Multi-AZ
- D) Database Migration Service
Which DynamoDB feature enables multi-region active-active replication?
- A) DAX
- B) Global Tables
- C) DynamoDB Streams
- D) TTL
What is the main advantage of AWS Global Accelerator in a multi-region architecture?
- A) Caches static content at edge locations
- B) Routes traffic to the optimal endpoint via the AWS global network
- C) Provides DNS failover
- D) Decouples application components
In a Pilot Light DR strategy, which resources are typically running in the DR region?
- A) Full production environment
- B) Core data services (e.g., database replicas) with compute stopped
- C) Only S3 backups
- D) Nothing is running
A company wants to protect against accidental deletion of an S3 object. Which feature should they enable?
- A) Versioning
- B) Lifecycle policies
- C) Cross-region replication
- D) Transfer Acceleration
What is the purpose of an Elastic Load Balancer health check?
- A) To monitor CPU utilization
- B) To route traffic only to healthy instances
- C) To store logs
- D) To encrypt traffic
Which DR strategy replicates your application in a scaled-down state in the DR region?
- A) Backup & Restore
- B) Pilot Light
- C) Warm Standby
- D) Multi-Site
A company needs to automatically replace an unhealthy EC2 instance in an Auto Scaling group. What is this process called?
- A) Scaling out
- B) Health check replacement
- C) Self-healing
- D) Instance refresh
How many copies of data does Aurora store across 3 Availability Zones?
- A) 3
- B) 6
- C) 9
- D) 2
What is the primary purpose of an RDS read replica?
- A) High availability with automatic failover
- B) Offload read traffic from the primary database
- C) Disaster recovery with zero RPO
- D) Data encryption
A company needs a recovery solution where data can be restored from daily snapshots. What is their likely RPO?
- A) 1 hour
- B) 15 minutes
- C) 24 hours
- D) Near zero
Which AWS service provides DNS-level failover across multiple regions?
- A) CloudFront
- B) Route53
- C) Global Accelerator
- D) ALB
What is the minimum number of EC2 instances you should run in an Auto Scaling group for high availability?
- A) 1
- B) 2
- C) 3
- D) 4
A company’s RDS database fails in an AZ outage with Multi-AZ enabled. What happens during failover?
- A) A read replica is promoted to primary
- B) The standby instance in another AZ becomes the new primary
- C) A new instance is launched
- D) The database is restored from snapshot
What is the RPO of DynamoDB Global Tables?
- A) Under 1 second
- B) 5 seconds
- C) 1 minute
- D) 5 minutes
Which architecture supports both a Pilot Light and Warm Standby approach depending on the size of the DR footprint?
- A) Multi-AZ
- B) Multi-Region active-passive
- C) Multi-Region active-active
- D) Single AZ

📝 Answer Key

B — RPO defines acceptable data loss (how far back you lose data).
A — Backup & Restore is cheapest but hours of RTO/RPO.
A — Aurora Global Database has < 1 second replication lag.
C — DynamoDB automatically replicates data across 3 AZs.
B — Warm Standby runs at reduced capacity, ready to scale up.
B — RDS Multi-AZ provides automatic failover to a standby in a different AZ.
D — Multi-Site Active-Active provides near-zero RTO and RPO with full production in multiple regions.
B — Aurora automatically fails over in ~30 seconds using 6 copies across 3 AZs.
C — Route53 Failover routing policy directs traffic to a primary resource with a secondary failover.
B — Cross-Region Read Replicas provide asynchronous replication with < 5 seconds lag.
B — DynamoDB Global Tables replicate data across regions with multi-region active-active support.
B — Global Accelerator routes traffic to optimal endpoints over the AWS global network.
B — Pilot Light keeps core data services (like DB replicas) running but compute instances stopped.
A — S3 Versioning protects against accidental deletion by preserving previous object versions.
B — Health checks monitor instance health and route traffic only to healthy targets.
C — Warm Standby runs a scaled-down version of production in the DR region.
C — Self-healing automatically replaces unhealthy EC2 instances in an Auto Scaling group.
B — Aurora stores 6 copies of data across 3 AZs for durability and availability.
B — Read replicas offload read traffic by providing additional read-only database endpoints.
C — Daily snapshots mean up to 24 hours of potential data loss (24-hour RPO).
B — Route53 provides DNS failover routing with health checks for multi-region failover.
B — At least 2 instances across 2 AZs ensure high availability in an Auto Scaling group.
B — Multi-AZ promotes the standby instance to primary with automatic DNS update.
A — DynamoDB Global Tables replicates data within under 1 second across regions.
B — Multi-Region active-passive supports both Pilot Light (minimal DR) and Warm Standby (scaled DR).

📚 Additional Resources#

Next → Cost Optimization