Are there lessons in the Amazon outage? Yes. Probably. But it’s complicated.
By way of background, sometime around the very early morning of Thursday, April 21, Amazon Web Services LLC experienced a significant failure that took many sites offline. Amazon has not yet posted (or, most likely, determined) the root cause. However, initial analysis suggests that a major network failure caused subsequent problems with Elastic Block Storage (EBS), an Amazon storage service that serves a role similar to that of a disk array within an enterprise datacenter.
Rightscale, which provides management services for users of Amazon and other public clouds, has posted a thorough analysis on its blog, based on what is known to date. But I’m more interested in the broader implications. In particular, what does this say about cloud computing and IT governance around cloud computing?
Hyperbolic public relations pitches began arriving in my inbox by late Thursday proclaiming that this or that expert would like to offer a perspective on how the incident proves that the “cloud isn’t ready for prime time.” As if there had never been a service failure within an on-premise datacenter. Drivel.
A more sophisticated take was that public cloud services are safe, but the onus is on the consumers of the services to do due diligence and otherwise architect their applications to ride through and mitigate service failures. I think this an unobjectionable statement -- but it is a generalization.
Returning to the case of Amazon, one of the symptoms of this particular failure was that multiple availability zones failed. The idea behind availability zones is that they’re independent and therefore unlikely to fail simultaneously. Thus, so long as you’re running redundantly in two different zones, you’re supposedly safe.
But not in this case. Rightscale notes that this “is an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.”
It’s true that, for sufficiently critical applications, depending upon one availability mechanism isn’t sufficient. But it’s also true that, for most purposes, depending upon a provider’s availability mechanisms to work as advertised is really not unreasonable. At least to this extent, I’m sympathetic to Klint Finley’s perspective on ReadWriteWeb that the blame here isn’t on the customers -- at least those who took advantage of high-availability mechanisms that weren’t.
Service providers also have other basic obligations. Returning to Amazon again, Rightscale harshly critiques that “Amazon’s communication, while better than during previous outages, still earns an F. This is probably the #1 threat to AWS’s business.”
In a separate vein, service providers should be expected to have stringent policies around physical access to their datacenters and, more generally, access to customer data. Google has posted a video
offering a rare glimpse inside one of its datacenters and describing its security and data protection procedures.
As a not-incidental side note, the video also briefly shows the tape libraries that Google uses as a sort of backup-of-last-resort. Although tape can sometimes seem like a relic of a bygone era, it’s worth noting that a Google Apps outage from earlier this year was ultimately recovered -- using tape.
All that said, the developers of applications ultimately have the obligation to not only do appropriate due diligence on their service providers but to architect appropriately. "Appropriately" is the key word here. Not every application need be protected from simultaneous meteor strikes (or, more likely, independent systems that aren’t actually independent).
Sometimes, you don’t have a choice. Some software-as-a-service applications deliver particular value to your business, and the vendor is well established. Maybe it makes sense to just put your trust there.
However, where possible, give yourself options and mitigate risk. Risk is unavoidable and needs to be measured against benefit, in any case. But at least understand what benefits you’re gaining and what risks you’re accepting.
OK, its an own-goal and an embarrasment and will provide fodder for those who are against cloud on principal, but for the pragmatist its just an illustration of why risk management can't be outsourced even if your data centre is...
No, it's not. S*** happens, in *any* aspect of life. It's important for all of us, in everything, to assume that things could fail and figure out what alternatives we'll use.
Ask the people at Fukushima. The safety of their reactor was predicated on having electricty to cool it, that an earthquake would only be an 8, that there wouldn't be a tsunami, and that if they did, the seawall would protect it -- all with tragic results.
It's true for all of us. What would we do if the water system failed? How many people store water? How would we cook if the electricity failed? How would we get more food if gas became nonexistent? Really, we can't count on anything, and the sooner everybody along the supply chain realizes it and accounts for it, the more secure we'll all be.
No matter if the data is on site or off site, outages happen. What concerned me more was the data loss. Were backups not performed? If they were, couldn't the data be restored from the backups? If not, why not? What are the penalties in such circumstances?
This goes to show the reality of the situation that was inevitable. People shouldn't rely completely upon any cloud based service, or any centrally controlled service for that matter for critical data storage. Storage of data or running services should happen on multiple platforms. Relying solely upon "the cloud" (aka utility computing) is only asking for trouble. Is this cloud hype over with yet?
I don't think I'd rely on a single vendor for all my data, no matter how much I had and how good their gurantees were. It might be too expensive to have a 'hot backup' somewhere else, might have to trust them on that part, but I think in a mission critical application it would be highly negligent not to have another copy, somewhere else completely (and I mean not just geographically but at a different vendor or in-house).
@chuck, from a personal stand-point it's not so much of a big deal to keep a local backup of your data. Gmail itself is a free service so you don't mind experiencing an outage once in a while or losing some of your data. However, when it comes to business and when there's money at stake, the game changes. Some businesses really can't afford to have outages or lose any of their data. The cloud vendors have to guarantee the up-time for their service and stay committed on the reliability part. I agree that the incident will serve as an important lesson for both cloud vendors and organizations looking to go on cloud.
I agree with abdlah's assessment of the article. Gordon, you summed things up very nicely and I compliment you on that. I'd like to add a couple of small comments:
First--I was one of those who lost access to gmail during the google apps outage earlier this year. I was among those whose data was restored from tape and I am very glad that Google does use that 'old technology' as a final recourse. I have since adopted the practice of always keeping my local copy of my gmail inbox current. The incident brought home to me how much I was relying on a single provider, and I took steps to resolve that issue.
To me, the key to protecting yourself is to have redundant copies, not only the redundant copies maintained for you by a provider such as amazon, googele, or your isp, but at least one additional copy somewhere else. Preferably you have a copy at home on your own computer (or for a company, at your datacenter). I'm inclined to have things stored a couple of different places on the web, too, but I've been accused of being paranoid upon occasion...
Recently I've been learning a lot about hacked websites and their recovery, and the methods to keep a 'hack' from happening or at least from spreading. A backup is little help if it has the same infection as your actual site. I have a feeling this is an area that will require increased vigilance over the next few years.
Gordon, your article is a welcome contribution to the discussion on the reliability of Cloud Computing. It does seem that some writers are giving a negative slant to the inciddent.
Your article is a more well rounded discussion on the issues identifying critical pros and cons as well as best practice towards Cloud adoption. Thanks.
I don't think that high cost is the only reason preventing Amazon and Google from switching from tape backups technology. There is much more to this, such as the ability to remove data, transport it to another location and so on. This article discusses the question in detail.
Yes thats the bad thing about Amazon. They should give up the old style of backing up and move with the trend. I know it might be a big process and might cost a big amount during the process but its worth now rather than facing a much bigger crisis
The ThinkerNet does not reflect the views of TechWeb. The ThinkerNet is an informal means of communication to members and visitors of the Internet Evolution site. Individual authors are chosen by Internet Evolution to blog. Neither Internet Evolution nor TechWeb assume responsibility for comments, claims, or opinions made by authors and ThinkerNet bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.
The digerati in places like Silicon Valley and Cambridge, Mass., aren't just early adopters by heartland standards. They're way out at the very forefront of technology and idea spread.
For more than the past year, we've seen a steady drumbeat of announcements in the technology space that analysts and developers have taken to calling “NoSQL.”
New York's Metropolitan Transit Authority is conducting a pilot test of digital kiosks to guide subway users to where they want to go more efficiently and at lower cost.
The whole Amazon.reader debate is a double-stupid. It's stupid to think that there's any e-book buyer who doesn't know Amazon's URL, and it was stupider to let ICANN launch the whole free-form TLD initiative to start with.
While NFC's original goal was to enhance mobile commerce applications, it is finding its way into a number of other uses, which is creating both opportunity as well as challenges for IT departments.
Enterprises would like to move to cloud computing but are hesitant because they are concerned about providers’ ability to secure company data. Here are some tips that help to ensure that if breaches occur, the business is not left holding the bag.
Edmunds separates customers into segments based on the info it collects on its site and from partners, and uses that to push out custom content, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
The automotive website uses propensity modeling to target ads and customer registration forms, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
Expert Integrated Systems: Changing the Experience & Economics of IT In this e-book, we take an in-depth look at these expert integrated systems -- what they are, how they work, and how they have the potential to help CIOs achieve dramatic savings while restoring IT's role as business innovator. READ THIS eBOOK
your weekly update of news, analysis, and
opinion from Internet Evolution - FREE! REGISTER HERE
Wanted! Site Moderators Internet Evolution is looking for a handful of readers to help moderate the message boards on our site as well as engaging in high-IQ conversation with the industry mavens on our thinkerNet blogosphere. The job comes with various perks, bags of kudos, and GIANT bragging rights. Interested?
To save this item to your list of favorite Internet Evolution content so you can find it later in your Profile page, click the "Save It" button next to the item.
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE
M2M: Rise of the Machines? Not Yet David Weldon In the 1970 science fiction thriller Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M. CLICK FOR MORE