Amazon explains exactly why its cloud server outage took down much of the internet
Amazon has laid out why its cloud computing service went dark this week, taking down online banking platforms, a government website, social media, and food delivery services for hours.
In a lengthy description posted on its site, Amazon Web Services (AWS) said there was a bug in its automation software that caused cascading issues.
The root of the problem was that its customers – including companies such as Signal– could not connect to DynamoDB, a system that stores customer data.
AWS said this was because of “a latent defect within the service’s automated DNS [domain name system] management system”.
DNS can be thought of as the internet’s phonebook. It translates website names, such as www.amazon.com, into machine-friendly IP addresses that computers use to find each other on a network.
But AWS’s DNS automation system deleted DNS records for its regional endpoint, and so it could not connect to DynamoDB and other services.
The bug failed to automatically repair, and required manual operator intervention.
These issues hit many core AWS services in its North Virginia region, where Amazon has a headquarters. Though the issues were resolved in a matter of hours, the total impact on websites and apps lasted 14.5 hours. More than 8 million people reported issues.
AWS customers, including Signal, Roblox, Snapchat, and the UK’s tax and revenue website, were among the 2,000 sites affected, according to Downdetector, a site that monitors internet outages.
The incident prompted technology experts to highlight Europe’s overreliance on a single cloud service and its own cloud ambitions.
Today