The outage at Amazon’s Web Services last week, which brought down online services and hurt Apple’s iCloud platform, was due to an incorrect command execution by a team member of Amazon’s Simple Storage Service.
In a blog post, Amazon said the Simple Storage Service team was debugging an issue that slowed down the payment platform and inadvertently executed the wrong command that resulted in more servers being removed than planned.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests,” Amazon wrote in the blog post.
Amazon said that to prevent something like this from happening again, it changed its subsets so that server capacity can be moved on a slower basis. It also said it put in place more safeguards so it can conduct checks.
“We want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” Amazon wrote.