Reliability

The ability to recover from failure and mitigate disruptions.

Design Principles

  • Test recovery procedures

  • Automatically recover from failure

  • Scale horizontally

  • Stop guessing capacity

  • Automate change

Best Practices

Foundations

Change Management

Failure Managment

Disaster Recovery Strategy

  • RTO (Recovery Time Objective) - How long to recover

  • RPO (Recovery Point Objective) - How much data is lost

Backup and Restore

  • Backup data to AWS or second region (S3, snapshots)

  • Have AMIs in recovery region

  • CloudFormation templates standing by

  • In Case of Disaster

    • Spin up Instances from AMIs (use templates)

    • Restore backup data

    • Modify DNS to point to new instances

  • RTO - Time it takes to launch new instances, restore data, update DNS

  • RPO - Data generated since last backup

Pilot Light

  • Cross Region Replication

    • RDS, DynamoDB, S3

  • Instances stopped

  • Smaller DB instance

  • In Case of Disaster

    • Start instances

    • Scale up DB, Promote to Primary

    • Modify DNS or use Route53 failover

  • RTO - Time to startup instances and scale

  • RPO - replication lag only

Low Capacity Standby

  • Cross region replication

  • Similar to Pilot Light

  • Some capacity running 24/7

  • Continuous testing with trick traffic

  • Multi-Master Option (Aurora)

  • In Case of Disaster

    • Scale up/Autoscale to full production capacity

    • Route53 failover for DNS

  • RTO - time to scale

  • RPO - replication lag only

Multi-Site Active-Active

  • Cross region replication or Multi-Master

  • Full capacity running 24/7 in two regions

  • Multi-Master Option (Aurora)

  • In Case of Disaster

    • Route53 failover for DNS

  • RTO - time to fail over

  • RPO - replication lag only

Last updated

Was this helpful?