Are there lessons in the Amazon outage? Yes. Probably. But it’s complicated.
By way of background, sometime around the very early morning of Thursday, April 21, Amazon Web Services LLC experienced a significant failure that took many sites offline. Amazon has not yet posted (or, most likely, determined) the root cause. However, initial analysis suggests that a major network failure caused subsequent problems with Elastic Block Storage (EBS), an Amazon storage service that serves a role similar to that of a disk array within an enterprise datacenter.
Rightscale, which provides management services for users of Amazon and other public clouds, has posted a thorough analysis on its blog, based on what is known to date. But I’m more interested in the broader implications. In particular, what does this say about cloud computing and IT governance around cloud computing?
Hyperbolic public relations pitches began arriving in my inbox by late Thursday proclaiming that this or that expert would like to offer a perspective on how the incident proves that the “cloud isn’t ready for prime time.” As if there had never been a service failure within an on-premise datacenter. Drivel.
A more sophisticated take was that public cloud services are safe, but the onus is on the consumers of the services to do due diligence and otherwise architect their applications to ride through and mitigate service failures. I think this an unobjectionable statement -- but it is a generalization.
Returning to the case of Amazon, one of the symptoms of this particular failure was that multiple availability zones failed. The idea behind availability zones is that they’re independent and therefore unlikely to fail simultaneously. Thus, so long as you’re running redundantly in two different zones, you’re supposedly safe.
But not in this case. Rightscale notes that this “is an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.”
It’s true that, for sufficiently critical applications, depending upon one availability mechanism isn’t sufficient. But it’s also true that, for most purposes, depending upon a provider’s availability mechanisms to work as advertised is really not unreasonable. At least to this extent, I’m sympathetic to Klint Finley’s perspective on ReadWriteWeb that the blame here isn’t on the customers -- at least those who took advantage of high-availability mechanisms that weren’t.
Service providers also have other basic obligations. Returning to Amazon again, Rightscale harshly critiques that “Amazon’s communication, while better than during previous outages, still earns an F. This is probably the #1 threat to AWS’s business.”
In a separate vein, service providers should be expected to have stringent policies around physical access to their datacenters and, more generally, access to customer data. Google has posted a video
offering a rare glimpse inside one of its datacenters and describing its security and data protection procedures.
As a not-incidental side note, the video also briefly shows the tape libraries that Google uses as a sort of backup-of-last-resort. Although tape can sometimes seem like a relic of a bygone era, it’s worth noting that a Google Apps outage from earlier this year was ultimately recovered -- using tape.
All that said, the developers of applications ultimately have the obligation to not only do appropriate due diligence on their service providers but to architect appropriately. "Appropriately" is the key word here. Not every application need be protected from simultaneous meteor strikes (or, more likely, independent systems that aren’t actually independent).
Sometimes, you don’t have a choice. Some software-as-a-service applications deliver particular value to your business, and the vendor is well established. Maybe it makes sense to just put your trust there.
However, where possible, give yourself options and mitigate risk. Risk is unavoidable and needs to be measured against benefit, in any case. But at least understand what benefits you’re gaining and what risks you’re accepting.
— Gordon Haff works in marketing at
Red Hat Inc.