Slack fingers AWS auto-scaling failure in January outage postmortem

Slack says it has identified a scaling failure in its AWS Transit Gateways (TGWs) as the reason for the chat service’s monumental outage on 4 January. As a result, Amazon’s cloud computing arm said it is “reviewing the TGW scaling algorithms”.

The comms platform’s outage came at a bad time, just when people were getting back to work after the holiday period. This, said Slack in its detailed report, was a large part of the problem. Users re-opening Slack for the first time in a while had out-of-date caches and therefore requested more data than usual. “We go from our quietest time of the whole year to one of our biggest days quite literally overnight,” said the Slack team.

The problem with the TGW scaling was not immediately apparent. As so often happens, a large part of the challenge was understanding what the core problem was. The first symptom was that the “dashboarding and alerting service became unavailable.” This itself was a problem for troubleshooting since it has “dashboards with their pre-built queries.”

The situation got worse. There was “widespread packet loss” in Slack’s network leading to “saturation of our web tier” and Slack became completely unavailable. Automated systems intended to maintain health instead made the problem worse.

Network problems meant that Slack’s web servers were waiting for results from the backend, which meant that CPU utilisation on those web servers dropped. This “triggered some automated downscaling” and web servers were shut down. “Many of the incident responders on our call had their SSH sessions ended abruptly as the instances they were working on were de-provisioned,” said Slack – the answer being to disable the downscaling.

At the same time another algorithm detected increased numbers of threads on the web tier – sitting waiting for a response – which triggered automatic upscaling.

“We attempted to add 1,200 servers to our web tier between 7:01am PST and 7:15am PST,” said Slack. “Unfortunately, our scale-up did not work as intended.” In fact it made things worse.

Attempting to provision lots of instances when the network was failing meant resource limits were hit, including a Linux open files limit and an AWS quota. Many of the new instances were broken. Manual efforts to fix the provisioning were eventually successful and Slack began to recover.

It followed AWS’s guidance….

The core problem was in the network. Slack runs largely on AWS – the architecture is an AWS case study, though now out of date – and AWS engineers uncovered the problem with the TGWs. Slack had followed the practice AWS recommends, which is to organise workloads in separate AWS accounts.

There are multiple virtual private clouds (VPCs) and the TGWs are hubs that link the VPCs. A TGW is meant to scale automatically. “AWS manages high availability and scalability,” say the docs.

Spilling email all over the place

Countless emails wrongly blocked as spam after Cisco’s SpamCop failed to renew domain name at the weekend

READ MORE

In this case “our TGWs did not scale fast enough,” said Slack. AWS internal monitoring did fire off an alert to engineers who intervened to scale TGW capacity manually. “AWS assures us that they are reviewing the TGW scaling algorithms for large packet-per-second increases as part of their post-incident process,” said Slack.

A twist here was that the Slack monitoring and alerting systems followed the AWS guidance by being located in their own VPC, but this made them dependent on the TGWs that failed. Slack will now move its monitoring systems to be in the same VPC as the database they query. The company said it would also “re-evaluate our health-checking and autoscaling configurations.”

Much of this tale has a familiar ring. Automatic system repair services work well for minor problems but can make matters worse when there is a widespread issue.

It looks like the core problem was with AWS itself, though it also exposed issues in Slack’s systems. It also turned out that the AWS best practice of separating workloads into different VPCs had downsides. This does not mean that the best-practice advice was wrong, more that it is a matter of balancing different risks and finding the best compromise.

The entire outage was fixed within five hours. ®