How Should We Think About The AWS Outage?

amazon-s3-outage

Tuesday, February 28th, was a bad day for AWS and for AWS users who relied on the US-East-1 Region in Northern Virginia to run their business and/or to serve their customers. I won’t rehash what happened but readers can get details on the outage in this TechTarget article. You an also find the postmortem from Amazon here. And yes, I know Amazon did not officially classify it as an outage, but it was effectively so for many users. As expected, both apologists and detractors have taken to social media and the Internet to defend or to bury Amazon Web Services. I’ve read everything from “If a user’s application failed, it’s all their fault” to “AWS is so unreliable even Amazon doesn’t use it” and everything between. One of the most balanced reflections was actually written by my Rackspace colleague, Kevin Jackson.

I want to take a few moments to share some thoughts on the S3 outage/slow-down and what it means for users. Then I’ll walk through some tips for architecting against Region-level failures.

Some random but hopefully relevant thoughts on the outage:

  • Infrastructure is hard and infrastructure at scale is orders of magnitude harder. Managing many thousands of servers is not the same as managing hundreds of servers, but with a more powerful script.
  • Durability is not the same as availability.  I saw many people mixing up the two yesterday.
    • S3 durability speaks to the ability of the service to protect data and to ensure they are not damaged or lost.
    • S3 availability speaks to the ability to actually access the service in order to get to your data.
    • As far as we know, no data was lost which means AWS can continue to tout S3’s eleven 9s of durability. What did take a hit was their claim of four 9s of availability.
  • As expected, many vendors are coming out of the woodwork to to talk about how much safer it would be to use their on-premises or private cloud software or hardware.
    • It is true that the blast radius of an AWS outage is much greater than for a private environment given the number of companies that run their businesses on AWS.
    • But anyone who runs on-premises infrastructure of any scale and tells you they’ve not had any significant outage is probably lying.
    • If a vendor promises you that their on-premises solution is bullet proof and always available, they are not telling you the truth.
    • On-premises solutions do fail but you don’t hear about it because the knowledge is not exposed to the public. Vendors have a vested interest in hiding any flaws in their solution and customers have a vested interest in not taking the chance that they may be criticized for their choice of solutions.
  • Private clouds have their place but no private cloud solution provides or will ever provide the same level of services and innovation that public clouds will provides. This means that while you may be able to achieve higher availability by staying on premises, the tradeoff is that you will not be take advantage of the innovations of the Public Cloud. This tradeoff manifests itself in lost opportunity cost.
  • Companies like Amazon and Netflix were not impacted by the issues in the US-East-1 Region because they were architected to survive even the failure of a Region.
  • While there is a cost to every solution, the building blocks exist to create a highly available environment that can survive the failure of a Region.
  • Too many users still do not leverage multiple Availability Zones, let alone multiple Regions for their applications.

Here are some tips on building highly available multi-region applications and on setting up Disaster Recovery in the Cloud. It’s not meant to be comprehensive or detailed but to get folks thinking differently. You may want to read my primer on Regions and Availability Zones first if you need a refresher.

  • Sound principles for building resilient systems don’t just go away because you are in a public cloud. You have to adapt those principles to work with a new architecture and with a new set of tools and services.
  • Understand the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each of your applications so you can design the right solution for each use case.
  • Leverage multiple regions for your applications whenever possible. Below is an architecture diagram, courtesy of Pinterest, that illustrate what a multi-region deployment might look like.

pinterest-ha

  • There’s not a one size fits all solution for utilizing multiple AWS Regions. There are different approaches you can take depending on RTO, RPO and the amount of cost you are willing and able to incur and the tradeoffs you are willing to make. Some of these approaches include:
    • Recovering to another Region from backups – Back up your environment to S3, including EBS snapshots, RDS snapshots, AMIs and regular file backups. Since S3 only replicates data, by default, to Availability Zones within a single Region, you’ll need to enable cross-region replication to your DR Region. You’ll incur the cost of transferring and storing data in a second Region but won’t incur compute, EBS or database costs until you need to go live in your DR Region. The trade-off is the time required to launch your applications.
    • Warm standby in another Region – Replicate data to a second Region where you’ll run a scaled down version of your production environment. The scaled down environment is always live and sized to run the minimal capacity needed to resume business. Use Route 53 to switch over to your DR Region as needed. Scale up the environment to full capacity as needed. You get faster recovery but incur higher costs.
    • Hot standby in another Region – Replicate data to a second Region where you run a full version of your production environment. The environment is always live and invoking full DR involves switching traffic over using Route 53. You get even faster recovery but incur even higher costs.
    • Multi-Region active/active solution – Data is synchronized between both Regions and both Regions are used to service requests. This is the most complex to set up and the highest cost incurred. However, little or no downtime is suffered even when an entire Region fails. While the approaches above are really DR solutions, this is about building a true highly available solution.
  • One of the keys to a successful multi-region setup and DR process is to automate as much as possible. This includes backups, replication and launching your applications. Leverage tools such CloudFormation to capture the state of your environment and to automate launching of resources. This is particularly important if you plan to recover from backups and to create your DR environment from scratch when you invoke DR.
  • Test, test and test again to ensure that you are able to successfully recover from an Availability Zone or Region failure. Test not only your tools but your processes.

Obviously, much more can be said here and I plan to provide more details in the future about how to architect a multi-region application and to set up DR in the Cloud. Meanwhile, I hope that I’ve provided some food for thought on how we should think about Public Cloud and AWS outages.