Last week we reported about the outage of Amazon’s S3 storage service in the U.S. and Europe. Both regions experienced “elevated error rates”. The service which also went down on February, saw a downtime of almost 8 hours this time. Amazon, now has made an announcement regarding their service outage.
Amazon’s announcement reads:
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
In their post analysis of the situation, they say:
During our post-mortem analysis we’ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we’re taking:
we’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; we’ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; we’ve added additional monitoring and alarming of gossip rates and failures; we’re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we’re proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won’t be satisfied until performance is statistically indistinguishable from perfect.
Amazon’s effort to keep its users up to date was appreciated by many including Center Networks whose images were broken as the result of service breakdown. Amazon has modified their service which will hopefully avoid such service outages in the future.



lol, just lame excuses!
[Reply]
And this is not the 1st time its happening, the service went down in February as well. They really need to put their heads together and make a better strategy so that this won’t happen again, at least for sometime.
[Reply]
Meh, It does sound like a very nicely phrased excuse, but find me a web service (one that’s growing in large numbers) that doesn’t have some downtime. It comes hand in hand with the process of up-scaling. I don’t see this as having an impact on the global scheme.
[Reply]