Engineering a Single Point of Failure with AWS. Users Beware.

Much has been written about the fragility of Amazon Web Services‘ US-East-1 region. Every time AWS has an outage it seems to be the Eastern zone that brings the service down. Eastern has the oldest infrastructure and this, in part, explains why it’s often the center of attention. Given the fact that US-East is the least reliable of all of Amazon’s zone, one would have thought that AWS would move to limit the critical services running there. If a blog post recently by my friend Rene Buest is correct, AWS in fact builds many of its higher level services on top of US East, thus exacerbating the impact of problems. Let’s dig in to what Buest uncovered in his post.

Buest’s concerns centered around the apparent existence of a single point of failure for a number of its more sophisticated services. We need to circle back a little here – one of the key findings from historically AWS outages has been the fact that designing for failure, in the way Netflix approaches its infrastructure, can give an organization protection over point failures. Admittedly a service can go down if the entire breadth of the providers infrastructure fails, but this is far less likely than point failures. Spreading the risk across multiple geographies and zones is a logical approach for users.

One would think then that AWS would themselves adopt this approach and avoid having all their proverbial eggs in one basket. many AWS services are built on top of Elastic Block Storage (EBS), from Elastic Load Balancer to Relation Database Service and on to Elastic Beanstalk, they all rely on EBS being available in order to function. Buest points to a post mortem investigation of awe.sm’s performance during one AWS outage. Awe.sm moved away from EBS due to issues of poor I/O throughput, failures at a regional level and server failure modes on Ubuntu. As the company said:

For these reasons, and our strong focus on uptime, we abandoned EBS entirely, starting about six months ago, at some considerable cost in operational complexity (mostly around how we do backups and restores). So far, it has been absolutely worth it in terms of observed external uptime.

Unfortunately the services that are built on top of EBS (ELB, RDS, EB) rely on EBS to be up in order to function – if EBS fails ad customers want to balance their traffic to another region to compensate, the balancing isn’t available (since it sits atop EBS). It’s a circular fault that removes the ability of customers to use the very services that are designed to help them reduce the impacts of outages. Witness the status messages below from 25 August where both RDS and ELB suffered issues due to an EBS performance degradation:

EC2 (N. Virginia)
[RESOLVED] Degraded performance for some EBS Volumes

ELB (N. Virginia)
[RESOLVED] Connectivity Issues

RDS (N. Virginia)
[RESOLVED] RDS connectivity issues in a single availability zone

The bottom line here is that upstream services, including ones which ironically are used to direct traffic elsewhere in the event of an outage, are reliant on EBS in order to function. AWS has actually caused a single point of failure with EBS – it’s more important than ever that AWS ensures EBS is reliable. I reached out to a former AWS engineer to fact check the assertions – despite being, to this day, a fan of AWS, he was critical of the architecture that sees many AWS services rely on what is a fragile service:

AWS EBS depends on a database that runs in ONE AZ. That AZ is the most problematic. When that AZ is unavail, EBS crashes.  This was true June 2012. Seems like it’s still true based on recent events.   As of June 2012, that AZ is connected to another AZ (doesn’t stand alone) and is more fragile than usual. And when EBS is down…yes, most everything is down. RDS, ELB, etc.

It’s a pretty sorry state of affairs and one which can’t possibly continue of AWS is to live up to its promise to become a credible enterprise vendor. I expect significant investment to reinforce the robustness of EBS along with some clarity on short term measures to remove the weaknesses.

6 Comments

Leave a Reply