Small number of intermittent requests to Edge API failing
Incident Report for Flagsmith
Postmortem

What Happened

On Monday 16th December at around 14:25 GMT, we started receiving reports from customers that they were seeing elevated errors in their monitoring regarding connections to the Flagsmith API.

Our investigation suggested that there were no application level issues in the Flagsmith platform. Since all customer reports were coming from Eastern US regions, we moved traffic away from the region, redirecting to our US west region. By this time, we were also able to carry out our own testing using infrastructure set up in US east. Our testing, and feedback from our customers showed that moving the traffic away did not resolve the issue.

At 19:15 GMT, we opened a ticket with our infrastructure partner. Their investigations also confirmed that there were no issues with the Flagsmith platform itself. At 21:53 GMT, we created a ticket directly with our infrastructure provider (AWS) for them to investigate.

At around 23:30 GMT on 16th December, we implemented a work around by creating a new DNS record to point directly to our infrastructure in US east, bypassing the latency based routing (as provided by AWS Global Accelerator). We shared this with customers experiencing the issues, and they confirmed that it resolved the issue.

From here, we continued to investigate the issue with AWS support, providing them with additional information based on our testing, and reports from our customers.

On 18th December at 17:58, we received the following information from AWS confirming that there had been an issue with Global Accelerator in the US East region.

The team confirmed that between December 13 5:00 PM PST and December 17 2:50 PM PST, AWS Global Accelerator experienced intermittent connection failures for client traffic served by the Ashburn, Virginia edge location. The issue has been resolved and the service is operating normally.

Following this, we were able to confirm that the issue was no longer reproducible for us, or for our customers.

What’s Next?

We have requested, and are currently waiting for, a full post-mortem from the AWS team which may affect our next steps. In the meantime, we have begun looking at alternative solutions for the Global Accelerator that we may be able to keep as a cold standby in case of similar issues in the future.

Posted Dec 20, 2024 - 17:20 UTC

Resolved
We have received confirmation from multiple customers in the US east region now that the issue has been resolved.
Posted Dec 19, 2024 - 14:04 UTC
Monitoring
Our infrastructure provider has confirmed they were intermittently unavailable between December 14, 01:00 UTC and December 17, 22:50 UTC. We are working with affected customers to confirm that this issue is resolved.
Posted Dec 18, 2024 - 18:56 UTC
Identified
The issue has been isolated to a specific component in our hosting provider's network infrastructure. We are working with them, and escalating to get the issue resolved.
Posted Dec 17, 2024 - 15:49 UTC
Update
We have passed on all of the troubleshooting information to our infrastructure provider and are awaiting further information from them on this issue.

If you are affected by this issue, please get in touch with us at support@flagsmith.com with any information that you can share so we can identify this intermittent issue.
Posted Dec 16, 2024 - 23:43 UTC
Investigating
We have had further reports of similar issues following the migration of traffic away from us-east-2, we are continuing to investigate.
Posted Dec 16, 2024 - 21:59 UTC
Monitoring
We have migrated all traffic away from us-east-2 and are monitoring the impact.
Posted Dec 16, 2024 - 20:17 UTC
Update
The issue seems to only be affecting clients connecting to the us-east-2 region. We're currently redirecting traffic away from the region.
Posted Dec 16, 2024 - 19:10 UTC
Investigating
We are currently investigating the Edge API sporadically timing out some requests or refusing connections.
Posted Dec 16, 2024 - 19:07 UTC
This incident affected: Edge API.