Aws: The Good, the Bad and the Ugly
Nobody realizes how many companies are using Amazon’s Elastic Compute Cloud (EC2) somewhere in their stack until it has an outage, and suddenly it seems like half the Internet goes away. But it’s not like Amazon got lucky: they have an awesome product. Everybody uses AWS because EC2 has radically simplified running software, by hugely lowering the amount you need to know about hardware in order to do so, and the amount of money you need to get started.
The first and most important thing to know about EC2 is that it is not merely a virtualized hosting service. A better way of thinking about it is as employing a fractional system and network administrator: instead of employing one very expensive person to do a whole lot of automation for you, you instead pay a little bit more for every box you own, and you have whole classes of problems abstracted away.
Power and network topology, hardware costs and vendor differences, network storage systems — these are things you used to have to think about back in 2004 (or regretted not thinking about). With AWS and its growing crop of competitors, you no longer do — or at least not until you get much bigger.
The most obvious cost advantage of AWS is that it has literally zero setup costs: you use the same Amazon account you use to order random junk off the Internet, click a button, and start paying for servers, by the hour. You only pay for the boxes when they run, and you only pay for storage that’s actually in use, so your startup costs are minimal, and it encourages experimentation at the hardware level: spin up 10x more capacity than you need, run load tests, and then spin them back down until you really need them. That’s not just convenient, that’s revolutionary.
While we love EC2 and couldn’t have got where we are without it, it’s important to be honest that not everything is sunshine and roses. EC2 has serious performance and reliability limitations that it’s important to be aware of, and build into your planning.
There’s a few important things we’ve learned about this region-zone pattern:
- Virtual hardware doesn’t last as long as real hardware. Our average observed lifetime for a virtual machine on EC2 over the last 3 years has been about 200 days. After that, the chances of it being “retired” rise hugely. And Amazon’s “retirement” process is unpredictable: sometime they’ll notify you ten days in advance that a box is going to be shut down; sometimes the retirement notification email arrives 2 hours after the box has already failed.
You need to be in more than one zone, and redundant across zones. It’s been our experience that you are more likely to lose an entire zone than to lose an individual box. So when you’re planning failure scenarios, having a master and a slave in the same zone is as useless as having no slave at all — if you’ve lost the master, it’s probably because that zone is unavailable.
Multi-zone failures happen, so if you can afford it, go multi-region too. US-east, the most popular (because oldest and cheapest) AWS region, had region-wide failures in June 2012, in March 2012, and most spectacularly in April 2011, which was nicknamed cloudpocalypse. Our take on this — and we’re probably making no friends at AWS saying so — is that AWS region-wide instability seem to frequently have the same root cause
Elastic Block Store (EBS) is fundamental to the way AWS expects you to use EC2: it wants you to host all your data on EBS volumes, and when instances fail, you can switch the EBS volume over to the new hardware, in no time and with no fuss. It wants you to use EBS snapshots for database backup and restoration. It wants you to host the operating system itself on EBS, known as “EBS-backed instances“.
In our admittedly anecdotal experience, EBS presented us with several major challenges
I/O rates on EBS volumes are poor. I/O rates on virtualized hardware will necessarily suck relative to bare metal, but in our experience EBS has been significantly worse than local drives on the virtual host (what Amazon calls “ephemeral storage”). EBS volumes are essentially network drives, and have all the performance you would expect of a network drive — i.e. not great.
EBS fails at the region level, not on a per-volume basis. In our experience, EBS has had two modes of behaviour: all volumes operational, or all volumes unavailable. Of the three region-wide EC2 failures in us-east that I mentioned earlier, two were related to EBS issues cascading out of one zone into the others. If your disaster recovery plan relies on moving EBS volume around, but the downtime is due to an EBS failure, you’ll be hosed.
The failure mode of EBS on Ubuntu is extremely severe: because EBS volumes are network drives masquerading as block devices, they break abstractions in the Linux operating system. This has led to really terrible failure scenarios for us, where a failing EBS volume causes an entire box to lock up, leaving it inaccessible and affecting even operations that don’t have any direct requirement of disk activity.
Because some AWS value-added services are built on EBS, they fail when EBS fails. This is true of Elastic Load Balancer (ELB), Relational Database Service (RDS), Elastic Beanstalk and others. And EBS — in our experience — seems to nearly always lie at the core of major failures at Amazon. So if EBS fails and you need to suddenly balance traffic to another region, you can’t — because your load balancer also runs on EBS.