Are there lessons in the Amazon outage? Yes. Probably. But itís complicated.
By way of background, sometime around the very early morning of Thursday, April 21, Amazon Web Services LLC experienced a significant failure that took many sites offline. Amazon has not yet posted (or, most likely, determined) the root cause. However, initial analysis suggests that a major network failure caused subsequent problems with Elastic Block Storage (EBS), an Amazon storage service that serves a role similar to that of a disk array within an enterprise datacenter.
Rightscale, which provides management services for users of Amazon and other public clouds, has posted a thorough analysis on its blog, based on what is known to date. But Iím more interested in the broader implications. In particular, what does this say about cloud computing and IT governance around cloud computing?
Hyperbolic public relations pitches began arriving in my inbox by late Thursday proclaiming that this or that expert would like to offer a perspective on how the incident proves that the ďcloud isnít ready for prime time.Ē As if there had never been a service failure within an on-premise datacenter. Drivel.
A more sophisticated take was that public cloud services are safe, but the onus is on the consumers of the services to do due diligence and otherwise architect their applications to ride through and mitigate service failures. I think this an unobjectionable statement -- but it is a generalization.
Returning to the case of Amazon, one of the symptoms of this particular failure was that multiple availability zones failed. The idea behind availability zones is that theyíre independent and therefore unlikely to fail simultaneously. Thus, so long as youíre running redundantly in two different zones, youíre supposedly safe.
But not in this case. Rightscale notes that this ďis an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.Ē
Itís true that, for sufficiently critical applications, depending upon one availability mechanism isnít sufficient. But itís also true that, for most purposes, depending upon a providerís availability mechanisms to work as advertised is really not unreasonable. At least to this extent, Iím sympathetic to Klint Finleyís perspective on ReadWriteWeb that the blame here isnít on the customers -- at least those who took advantage of high-availability mechanisms that werenít.
Service providers also have other basic obligations. Returning to Amazon again, Rightscale harshly critiques that ďAmazonís communication, while better than during previous outages, still earns an F. This is probably the #1 threat to AWSís business.Ē
In a separate vein, service providers should be expected to have stringent policies around physical access to their datacenters and, more generally, access to customer data. Google has posted a video
offering a rare glimpse inside one of its datacenters and describing its security and data protection procedures.
As a not-incidental side note, the video also briefly shows the tape libraries that Google uses as a sort of backup-of-last-resort. Although tape can sometimes seem like a relic of a bygone era, itís worth noting that a Google Apps outage from earlier this year was ultimately recovered -- using tape.
All that said, the developers of applications ultimately have the obligation to not only do appropriate due diligence on their service providers but to architect appropriately. "Appropriately" is the key word here. Not every application need be protected from simultaneous meteor strikes (or, more likely, independent systems that arenít actually independent).
Sometimes, you donít have a choice. Some software-as-a-service applications deliver particular value to your business, and the vendor is well established. Maybe it makes sense to just put your trust there.
However, where possible, give yourself options and mitigate risk. Risk is unavoidable and needs to be measured against benefit, in any case. But at least understand what benefits youíre gaining and what risks youíre accepting.
— Gordon Haff works in marketing at
Red Hat Inc.