I really don’t want to be in the shoes of the guys over at AWS NOC – this is not a good day for them. Some of the Amazon Web Services hosted in the US-EAST zone are affected (as of the time this article has been written) due to an issue related to EBS re-mirroring. We have experienced a similar issue on Saturday, when connection to an RDS instance kept timing out – we then concluded that it was caused by a network issue inside AWS. That only lasted for about 15 minutes, but it could have been a warning sign of a bigger problem arising. I’m sorry to say, but it was. Here’s what Amazon declared at 8:54 AM PDT:
We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
(http://status.aws.amazon.com/)
This issue has affected the RDS service and EBS volumes. Among the victims of this outage we count Foursquare, Reddit or Quora. Even some of the RDS instances I get to manage have been affected – luckily no critical production deployments. As of 11:09 AM PDT, Amazon still cannot give an ETA on when all services will be fully recovered – hopefully no critical data will be lost.
This outage gives a grim wake-up call on the risks associated with using cloud computing – even Amazon can have a serious problem. Don’t get me wrong, I think cloud computing is great – scalability, high-availability, monitoring, business model to pay for what you actually use -, and Amazon has been doing a great job, but it now becomes clear that even a cloud (or at least part of it) can become a single point of failure. For the past 2 years since I’ve started using AWS I’ve experienced weird and annoying problems with the reliability of the AWS services – ELB’s going MIA, EC2 instances gone dead, but this is by far the worst.
We’ll probably see some good things coming out of this story though, such as complete overhaul of the strategies behind deploying applications in the cloud – distributing services across different availability zones, and why not, across different clouds, backing everything up with private clouds.
If you have any comments or experiences you would like to share, just post a comment below, as I am very curios to see what strategies you used to mitigate this issue?