CtrlK

Reliability

The ability to recover from failure and mitigate disruptions.

Design Principles

Test recovery procedures
Automatically recover from failure
Scale horizontally
Stop guessing capacity
Automate change

Best Practices

Foundations

Access Control
- IAM
Isolated Networks
- VPC
Service Limits
- Trusted Advisor
DDOS Protection
- Shield

Change Management

Control Access
- CloudWatch
Configuration Awareness
- AWS Config
Audit AWS APIs
- CloudTrail
Demand Managment
- AutoScaling

Failure Managment

Infrastructure as Code
- CloudFormation
Durable Backups
- Simple Storage Service (S3)
Durable Archives
- Glacier
Reliable Key Management
- AWS KMS

Disaster Recovery Strategy

RTO (Recovery Time Objective) - How long to recover
RPO (Recovery Point Objective) - How much data is lost

Backup and Restore

Backup data to AWS or second region (S3, snapshots)
Have AMIs in recovery region
CloudFormation templates standing by
In Case of Disaster
- Spin up Instances from AMIs (use templates)
- Restore backup data
- Modify DNS to point to new instances
RTO - Time it takes to launch new instances, restore data, update DNS
RPO - Data generated since last backup

Pilot Light

Cross Region Replication
- RDS, DynamoDB, S3
Instances stopped
Smaller DB instance
In Case of Disaster
- Start instances
- Scale up DB, Promote to Primary
- Modify DNS or use Route53 failover
RTO - Time to startup instances and scale
RPO - replication lag only

Low Capacity Standby

Cross region replication
Similar to Pilot Light
Some capacity running 24/7
Continuous testing with trick traffic
Multi-Master Option (Aurora)
In Case of Disaster
- Scale up/Autoscale to full production capacity
- Route53 failover for DNS
RTO - time to scale
RPO - replication lag only

Multi-Site Active-Active

Cross region replication or Multi-Master
Full capacity running 24/7 in two regions
Multi-Master Option (Aurora)
In Case of Disaster
- Route53 failover for DNS
RTO - time to fail over
RPO - replication lag only

PreviousOperational Excellence NextPerformance Efficiency

Last updated 6 years ago

Was this helpful?