Core API Outage
Incident Report for Flagsmith
Resolved
Our Core API was overwhelmed by massive traffic spike, causing the core SQL database to become extremely slow. This led to ECS tasks failing the health checks, prompting the load balancer to start and stop new tasks, which in turn added more load to the already maxed-out database.

We tried several approaches to rate limiting the source of the traffic, but eventually had to temporarily stop traffic at the load balancer for 2 minutes in order to stabilise the system.

We are working on implementing AWS API Gateway to include rate limiting at the gateway level to avoid these sort of incidents in the future.
Posted Jul 04, 2024 - 08:36 UTC