Automated scaling activity caused AWS East to fail on Tuesday, Amazon says
The capacity scaling processes triggered “unexpected behavior,” disrupting AWS’s internal network and services hosted in the popular region.
Amazon provided additional details of last week’s AWS outage, which put many cloud-based businesses and services out of touch for hours on Tuesday. Some hosted or resource-dependent services in certain data centers in the AWS East-1 region are unresponsive Tuesday due to an internal network outage.
During the outage, AWS us-east-1 customers were unable to access services, including EC2, Connect, DynamoDB, Glue, Athena, Timestream, and Chime. Popular streaming services affected by the outage included Disney Plus and Netflix. The AWS outage has blocked users of the dating app Tinder, the cryptocurrency service Coinbase, and the cash app Venmo. Players couldn’t launch popular video games like PUBG and League of Legends. Some Amazon couriers and parcel delivery drivers were also unable to do their jobs.
AWS Regions are physical locations around the world where the company operates data centers and connects its wide area network (WAN). Due to its location and the variety of its services, many AWS customers depend on Region East-1.
Region East-1 is located in Northern Virginia in “Data Center Alley”. Seventy percent of global Internet traffic passes through the region, according to estimates. Therefore, aws-east-1 is extremely popular with AWS customers. It is also the most diverse region in AWS, with more business areas and local areas than anywhere else. Business areas are logical collections of data centers within a region. Local zones contain advanced computing resources.
These problematic scaling processes triggered an increase in connection activity that overwhelmed network devices, according to AWS.
“At 7:30 am PST, an automated activity to scale the capacity of one of the AWS services hosted in the core AWS network triggered unexpected behavior from a large number of customers within the internal network.” , the company noted.
“Previously unobserved behavior”
“These delays have increased the latency and errors for the services communicating between these networks, causing even more retries and retries to connect. This has led to persistent congestion and performance issues on the devices connecting the two networks, ”said AWS.
To complicate mitigation efforts, AWS operators were flying blind, according to AWS notes.
“Instead, operators relied on the logs to figure out what was going on and initially identified high internal DNS errors,” AWS said. It would take further measurements and several hours before things returned to normal. Why?
“First, the impact on internal oversight limited our ability to understand the problem. Second, our internal deployment systems, which run on our internal network, were affected, further hampering our remediation efforts, ”said AWS.
AWS said it would not resume scaling activity from AWS East, which caused the outage before testing the fixes.
“Our network customers have well tested the request cancellation behaviors that are designed to allow our systems to recover from these types of congestion events, but a latent issue prevented these customers from properly opting out during this event.” , said AWS.
The automated scaling activity triggered “previously unobserved behavior,” for which AWS engineers are currently developing a fix. The company expects it to be deployed within the next two weeks.
AWS said it was reworking its service health dashboard to provide more accurate and timely information. AWS has also made additional changes to the network configuration to protect affected devices even if a similar event recurs.
“These corrective actions give us assurance that we will not see a recurrence of this problem,” the company said.