On Monday 16th December at around 14:25 GMT, we started receiving reports from customers that they were seeing elevated errors in their monitoring regarding connections to the Flagsmith API.
Our investigation suggested that there were no application level issues in the Flagsmith platform. Since all customer reports were coming from Eastern US regions, we moved traffic away from the region, redirecting to our US west region. By this time, we were also able to carry out our own testing using infrastructure set up in US east. Our testing, and feedback from our customers showed that moving the traffic away did not resolve the issue.
At 19:15 GMT, we opened a ticket with our infrastructure partner. Their investigations also confirmed that there were no issues with the Flagsmith platform itself. At 21:53 GMT, we created a ticket directly with our infrastructure provider (AWS) for them to investigate.
At around 23:30 GMT on 16th December, we implemented a work around by creating a new DNS record to point directly to our infrastructure in US east, bypassing the latency based routing (as provided by AWS Global Accelerator). We shared this with customers experiencing the issues, and they confirmed that it resolved the issue.
From here, we continued to investigate the issue with AWS support, providing them with additional information based on our testing, and reports from our customers.
On 18th December at 17:58, we received the following information from AWS confirming that there had been an issue with Global Accelerator in the US East region.
The team confirmed that between December 13 5:00 PM PST and December 17 2:50 PM PST, AWS Global Accelerator experienced intermittent connection failures for client traffic served by the Ashburn, Virginia edge location. The issue has been resolved and the service is operating normally.
Following this, we were able to confirm that the issue was no longer reproducible for us, or for our customers.
We have requested, and are currently waiting for, a full post-mortem from the AWS team which may affect our next steps. In the meantime, we have begun looking at alternative solutions for the Global Accelerator that we may be able to keep as a cold standby in case of similar issues in the future.