Reliability
Last updated
Was this helpful?
Last updated
Was this helpful?
The ability to recover from failure and mitigate disruptions.
Test recovery procedures
Automatically recover from failure
Scale horizontally
Stop guessing capacity
Automate change
Access Control
IAM
Isolated Networks
Service Limits
DDOS Protection
Shield
Control Access
Configuration Awareness
Audit AWS APIs
Demand Managment
AutoScaling
Infrastructure as Code
Durable Backups
Durable Archives
Glacier
Reliable Key Management
AWS KMS
RTO (Recovery Time Objective) - How long to recover
RPO (Recovery Point Objective) - How much data is lost
Backup data to AWS or second region (S3, snapshots)
Have AMIs in recovery region
CloudFormation templates standing by
In Case of Disaster
Spin up Instances from AMIs (use templates)
Restore backup data
Modify DNS to point to new instances
RTO - Time it takes to launch new instances, restore data, update DNS
RPO - Data generated since last backup
Cross Region Replication
RDS, DynamoDB, S3
Instances stopped
Smaller DB instance
In Case of Disaster
Start instances
Scale up DB, Promote to Primary
Modify DNS or use Route53 failover
RTO - Time to startup instances and scale
RPO - replication lag only
Cross region replication
Similar to Pilot Light
Some capacity running 24/7
Continuous testing with trick traffic
Multi-Master Option (Aurora)
In Case of Disaster
Scale up/Autoscale to full production capacity
Route53 failover for DNS
RTO - time to scale
RPO - replication lag only
Cross region replication or Multi-Master
Full capacity running 24/7 in two regions
Multi-Master Option (Aurora)
In Case of Disaster
Route53 failover for DNS
RTO - time to fail over
RPO - replication lag only