The Macrosite for News, Analysis and Opinion about the Future of the Internet
Gordon Haff

Takeaways From the Amazon Outage

Written by Gordon Haff
4/28/2011 29 comments
no ratings
DISCUSS     Email This

Are there lessons in the Amazon outage? Yes. Probably. But it’s complicated.

By way of background, sometime around the very early morning of Thursday, April 21, Amazon Web Services LLC experienced a significant failure that took many sites offline. Amazon has not yet posted (or, most likely, determined) the root cause. However, initial analysis suggests that a major network failure caused subsequent problems with Elastic Block Storage (EBS), an Amazon storage service that serves a role similar to that of a disk array within an enterprise datacenter.

Rightscale, which provides management services for users of Amazon and other public clouds, has posted a thorough analysis on its blog, based on what is known to date. But I’m more interested in the broader implications. In particular, what does this say about cloud computing and IT governance around cloud computing?

Hyperbolic public relations pitches began arriving in my inbox by late Thursday proclaiming that this or that expert would like to offer a perspective on how the incident proves that the “cloud isn’t ready for prime time.” As if there had never been a service failure within an on-premise datacenter. Drivel.

A more sophisticated take was that public cloud services are safe, but the onus is on the consumers of the services to do due diligence and otherwise architect their applications to ride through and mitigate service failures. I think this an unobjectionable statement -- but it is a generalization.

Returning to the case of Amazon, one of the symptoms of this particular failure was that multiple availability zones failed. The idea behind availability zones is that they’re independent and therefore unlikely to fail simultaneously. Thus, so long as you’re running redundantly in two different zones, you’re supposedly safe.

But not in this case. Rightscale notes that this “is an indication that the EBS control plane has dependencies across zones. Amazon did manage to contain the problem to one zone approx 3 hours after the onset.”

It’s true that, for sufficiently critical applications, depending upon one availability mechanism isn’t sufficient. But it’s also true that, for most purposes, depending upon a provider’s availability mechanisms to work as advertised is really not unreasonable. At least to this extent, I’m sympathetic to Klint Finley’s perspective on ReadWriteWeb that the blame here isn’t on the customers -- at least those who took advantage of high-availability mechanisms that weren’t.

Service providers also have other basic obligations. Returning to Amazon again, Rightscale harshly critiques that “Amazon’s communication, while better than during previous outages, still earns an F. This is probably the #1 threat to AWS’s business.”

In a separate vein, service providers should be expected to have stringent policies around physical access to their datacenters and, more generally, access to customer data. Google has posted a video offering a rare glimpse inside one of its datacenters and describing its security and data protection procedures.

As a not-incidental side note, the video also briefly shows the tape libraries that Google uses as a sort of backup-of-last-resort. Although tape can sometimes seem like a relic of a bygone era, it’s worth noting that a Google Apps outage from earlier this year was ultimately recovered -- using tape.

All that said, the developers of applications ultimately have the obligation to not only do appropriate due diligence on their service providers but to architect appropriately. "Appropriately" is the key word here. Not every application need be protected from simultaneous meteor strikes (or, more likely, independent systems that aren’t actually independent).

Sometimes, you don’t have a choice. Some software-as-a-service applications deliver particular value to your business, and the vendor is well established. Maybe it makes sense to just put your trust there.

However, where possible, give yourself options and mitigate risk. Risk is unavoidable and needs to be measured against benefit, in any case. But at least understand what benefits you’re gaining and what risks you’re accepting.

— Gordon Haff works in marketing at Red Hat Inc.

DISCUSS     Email This
Current display:       newest comments first       display in chronological order
Page 1 of 3   Next >
alistair
Rank: Cave Painter
Tuesday May 3, 2011 8:37:10 AM
no ratings

OK, its an own-goal and an embarrasment and will provide fodder for those who are against cloud on principal, but for the pragmatist its just an illustration of why risk management can't be outsourced even if your data centre is...

slfisher
Thinkernetter
Sunday May 1, 2011 8:58:35 AM
no ratings

You say, "But it’s complicated."

No, it's not. S*** happens, in *any* aspect of life. It's important for all of us, in everything, to assume that things could fail and figure out what alternatives we'll use. 

Ask the people at Fukushima. The safety of their reactor was predicated on having electricty to cool it, that an earthquake would only be an 8, that there wouldn't be a tsunami, and that if they did, the seawall would protect it -- all with tragic results.

It's true for all of us. What would we do if the water system failed? How many people store water? How would we cook if the electricity failed? How would we get more food if gas became nonexistent? Really, we can't count on anything, and the sooner everybody along the supply chain realizes it and accounts for it, the more secure we'll all be.

SecTech
Thinkernetter
Saturday April 30, 2011 10:09:58 PM
no ratings

No matter if the data is on site or off site, outages happen.  What concerned me more was the data loss.  Were backups not performed?  If they were, couldn't the data be restored from the backups?  If not, why not?  What are the penalties in such circumstances?

nathanwosnack
IQ Crew
Saturday April 30, 2011 10:08:17 PM
no ratings

This goes to show the reality of the situation that was inevitable. People shouldn't rely completely upon any cloud based service, or any centrally controlled service for that matter for critical data storage. Storage of data or running services should happen on multiple platforms. Relying solely upon "the cloud" (aka utility computing) is only asking for trouble. Is this cloud hype over with yet?

chuckgregory
IQ Crew
Friday April 29, 2011 6:04:11 PM
no ratings

I don't think I'd rely on a single vendor for all my data, no matter how much I had and how good their gurantees were. It might be too expensive to have a 'hot backup' somewhere else, might have to trust them on that part, but I think in a mission critical application it would be highly negligent not to have another copy, somewhere else completely (and I mean not just geographically but at a different vendor or in-house).

taimur_tz
Thinkernetter
Friday April 29, 2011 4:42:10 PM
no ratings

@chuck, from a personal stand-point it's not so much of a big deal to keep a local backup of your data. Gmail itself is a free service so you don't mind experiencing an outage once in a while or losing some of your data. However, when it comes to business and when there's money at stake, the game changes. Some businesses really can't afford to have outages or lose any of their data. The cloud vendors have to guarantee the up-time for their service and stay committed on the reliability part. I agree that the incident will serve as an important lesson for both cloud vendors and organizations looking to go on cloud.

chuckgregory
IQ Crew
Friday April 29, 2011 8:29:21 AM
no ratings

I agree with abdlah's assessment of the article. Gordon, you summed things up very nicely and I compliment you on that. I'd like to add a couple of small comments:

First--I was one of those who lost access to gmail during the google apps outage earlier this year. I was among those whose data was restored from tape and I am very glad that Google does use that 'old technology' as a final recourse. I have since adopted the practice of always keeping my local copy of my gmail inbox current. The incident brought home to me how much I was relying on a single provider, and I took steps to resolve that issue.

To me, the key to protecting yourself is to have redundant copies, not only the redundant copies maintained for you by a provider such as amazon, googele, or your isp, but at least one additional copy somewhere else. Preferably you have a copy at home on your own computer (or for a company, at your datacenter). I'm inclined to have things stored a couple of different places on the web, too, but I've been accused of being paranoid upon occasion...

Recently I've been learning a lot about hacked websites and their recovery, and the methods to keep a 'hack' from happening or at least from spreading. A backup is little help if it has the same infection as your actual site. I have a feeling this is an area that will require increased vigilance over the next few years.

Thanks again, Gordon.

 

abdlah
IQ Crew
Friday April 29, 2011 7:58:07 AM
no ratings

Gordon, your article is a welcome contribution to the discussion on the reliability of Cloud Computing. It does seem that some writers are giving a negative slant to the inciddent.

Your article is a more well rounded discussion on the issues identifying critical pros and cons as well as best practice towards Cloud adoption. Thanks.

ivka
IQ Crew
Friday April 29, 2011 7:57:45 AM
no ratings

I don't think that high cost is the only reason preventing Amazon and Google from switching from tape backups technology. There is much more to this, such as the ability to remove data, transport it to another location and so on. This article discusses the question in detail.

nimantha.de
IQ Crew
Friday April 29, 2011 6:04:09 AM
no ratings

Yes thats the bad thing about Amazon. They should give up the old style of backing up and move with the trend. I know it might be a big process and might cost a big amount during the process but its worth now rather than facing a much bigger crisis

Page 1 of 3   Next >
The ThinkerNet does not reflect the views of TechWeb. The ThinkerNet is an informal means of communication to members and visitors of the Internet Evolution site. Individual authors are chosen by Internet Evolution to blog. Neither Internet Evolution nor TechWeb assume responsibility for comments, claims, or opinions made by authors and ThinkerNet bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.
previous posts from Gordon Haff
Gordon Haff
Gordon Haff   12/3/2010   15 comments
The circumstances behind the latest Wikileaks documents being moved off of Amazon Web Services LLC remain somewhat murky. Senator Joe Lieberman's office is making claims, but Amazon itself hadn't made any direct statements as of this writing.
Gordon Haff
Gordon Haff   11/16/2010   15 comments
The Amazon Kindle's recently announced lending feature comes across more as a check-box response to the LendMe feature of Barnes & Noble's Nook e-reader than as a genuine innovation.
Gordon Haff
Gordon Haff   10/1/2010   26 comments
The digerati in places like Silicon Valley and Cambridge, Mass., aren't just early adopters by heartland standards. They're way out at the very forefront of technology and idea spread.
Gordon Haff
Gordon Haff   6/14/2010   9 comments
For more than the past year, we've seen a steady drumbeat of announcements in the technology space that analysts and developers have taken to calling “NoSQL.”
IETV: the thinkerNet on film
5
of
Kim Davis
Big-Data Can’t Always Sell Wine

5|21|13   |   2:23   |   4 comments


Whole Foods Global Wine Purchaser Doug Bell told me about some of the constraints on using analytics in the US wine market.
Paul J. Fleuranges
Digital Signage Keeps NYC Subway Straphangers on Track

5|6|13   |   3:51   |   No comments


New York's Metropolitan Transit Authority is conducting a pilot test of digital kiosks to guide subway users to where they want to go more efficiently and at lower cost.
Kim Davis
Fast Forward to the Future

4|23|13   |   2:29   |   20 comments


A look back at tech writing in the 90s makes us wonder where enterprise IT will be 20 years from now.
Mitch Wagner
Google Launches Its Most Depressing Service Yet

4|15|13   |   2:59   |   10 comments


Google's new Inactive Account Manager lets you control how Google disposes of your accounts when you die.
Second Shooter
Argument Over Top-Level Domains Is 'Stupid'

4|11|13   |   2:07   |   3 comments


The whole Amazon.reader debate is a double-stupid. It's stupid to think that there's any e-book buyer who doesn't know Amazon's URL, and it was stupider to let ICANN launch the whole free-form TLD initiative to start with.
Kim Davis
Ladies, Your Tablet Awaits

3|21|13   |   2:22   |   37 comments


ePad Femme is the world’s first tablet “made exclusively for women.”
Wisdom of the Big Chair
NFC Moves Into the Mainstream

3|20|13   |   2:16   |   No comments


While NFC's original goal was to enhance mobile commerce applications, it is finding its way into a number of other uses, which is creating both opportunity as well as challenges for IT departments.
Wisdom of the Big Chair
Integrating Security Into Your Cloud Contract

3|19|13   |   3:35   |   No comments


Enterprises would like to move to cloud computing but are hesitant because they are concerned about providers’ ability to secure company data. Here are some tips that help to ensure that if breaches occur, the business is not left holding the bag.
Brian Baron
How Edmunds.com Collects Customer Information

3|18|13   |   1:15   |   No comments


Edmunds separates customers into segments based on the info it collects on its site and from partners, and uses that to push out custom content, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
Brian Baron
How Edmunds.com Uses Analytics to Customize Site

3|14|13   |   0:47   |   No comments


The automotive website uses propensity modeling to target ads and customer registration forms, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
an IBM information resource
sponsored content
big blue blog
an IBM information resource
sponsored content
Expert Integrated Systems: Changing the Experience & Economics of IT
In this e-book, we take an in-depth look at these expert integrated systems -- what they are, how they work, and how they have the potential to help CIOs achieve dramatic savings while restoring IT's role as business innovator.

READ THIS eBOOK
your weekly update of news, analysis, and
opinion from Internet Evolution - FREE!

REGISTER HERE
Wanted! Site Moderators
Internet Evolution is looking for a handful of readers to help moderate the message boards on our site – as well as engaging in high-IQ conversation with the industry mavens on our thinkerNet blogosphere. The job comes with various perks, bags of kudos, and GIANT bragging rights. Interested?

Please email: moderators@internetevolution.com
Internet Evolution – not for thickies
Keep Critical Data With a Knowledge Management System
Taimoor Zubair
Fortune 500 companies lose at least
$31.5 billion a year by failing to share knowledge. A Knowledge Management System (KMS) can help companies significantly reduce these costs.

CLICK FOR MORE
M2M: Rise of the Machines? Not Yet
David Weldon
In the 1970 science fiction thriller
Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M.

CLICK FOR MORE
M2M: Rise of the Machines? Not Yet
David Weldon
In the 1970 science fiction thriller
Colossus: The Forbin Project, two giant supercomputers from the United States and Soviet Union secretly join forces to take control of the collective nuclear might of the two countries. In the film, the two machines discover each other's existence, communicate back-and-forth, share their collective data, and cut their human creators out of the process. It is the ultimate example of machine-to-machine communications, or M2M.

CLICK FOR MORE