Ruku, on 01 July 2012 - 01:38 PM, said:
Just to be clear, my concerns were more angled towards healthcare providers hosting mission critical (as in life or death!) operations in single AZs. Hearing about all the foolish companies placing their customers' lives at risk purely out of selfishness and/or an insane lack of foresight was chilling.
Oh my, agreed. And screw multiple AZs -- if you're running literal life-or-death services and you're not in multiple regions, then your company should not exist.
I have positively no idea how Amazon's implementation works (I'm still not convinced that EC2 is cost-effective enough to migrate to), but surely you can just configure multiple A records on the domains which point to different load balancers? It's not particularly sophisticated or elegant, but it's a primitive way to ensure connectivity when a load balancer fails. I still see every individual IP address as a potential disaster zone.
That's just it -- that's super easy and even fairly cheap for things like webservers, especially when your failover endpoints use auto-scaling so you only need to pay for very minimal instances unless they're needed. But EC2 wasn't even our problem this time. Being stretched across multiple AZs, each of our services were durable enough to deal with it. I could have had us back online in half the time if it weren't for the fact that RDS (managed MySQL, ironically set up with failover) failed.
Well that really sucks, and it may not exactly something anyone can change, at least from a technical standpoint. The only way they could possibly avoid issues like this in the future would be to preempt emergency situations and make the changes before an outage actually occurs, which would give their provisioning systems the necessary times to make the changes. Of course, they could always providie reliable UPS and generator systems (like every other damn datacentre) to failover onto when events like these occur, but I don't think we should really place blame until they've released their analysis.
Agreed agreed agreed. I hope they release their postmortem.
I guess it's time to consider multiple hosting providers as well as availability zones? This is a huge argument for interopability between cloud hosting services, and a phenomenal part of the reason our company would rather go about building its "cloud" on the likes of OpenStack and Eucalyptus than some proprietary platform spearheaded by such a huge private company. I love the premise, but their sheer scale of operations is somewhat intimidating :s
Really, we could stay 100% with Amazon and be fine. There's availability zones, which are physically separate datacenters within a region, and then the regions, of which there are 8 -- and 3 of those are in the US. So we could go multi-region and be fine. The issue is that, while Amazon has amazing support for spanning their services across AZs, there is absolutely zero support for spanning them between regions. Obviously for webservers this is a non-issue, as they're not interconnected and there's a million ways to scale them. But I'd have to stop using DynamoDB and RDS and SQS and SimpleDB and others, because awesome as they are, there's no multiregion sync on them. To take a Dynamo-based DB multi-region, I'd be spending $20k on Riak and that's BEFORE server costs and data costs (since you pay WAN rates between regions). Taking MySQL multi-region is just a shit show, and I'd probably have to run MySQL Cluster somewhere. Extra servers, an admin workload that I'd be looking at either hiring or outsourcing for, and even then failure recovery is a joke. I'd be better off looking at something like Oracle for our relational needs, which is more software cost. SQS is something we'd probably have to build in-house but even then we'd be running 3-4 more servers for durability and eating the associated cost.
What it boils down to is, at a certain point it's more cost-effective to hang up a "Please excuse our maintenance" sign for a few hours and cut our losses on the revenue for that period than it would be to undertake all that extra work and pay out all those software costs and server costs. All the while, this could be resolved almost entirely if AWS would support multi-region syncing/multi-region services. And every time they have an outage like this, there's more and more of an outcry for it -- so hopefully they'll come to their senses soon!