Major Core API outage

Incident Report for Flagsmith

Postmortem

At 12:46 UTC on Thursday 18th August, our monitoring picked up an increased number of HTTP 502s being served by our API. Upon investigation it became evident that an unexpected increase in load on the PostgreSQL database that serves our Core API was causing our application to struggle to serve some requests and we saw increased latency on those that were being served.

In an attempt to resolve the issue, we adjusted the settings in our ECS cluster to reduce the number of connections to the database. Unfortunately, making this change via our IaaC workflow meant that the ECS service tried to recreate all the tasks but couldn’t do so as the health reporting was unable to consistently report a healthy status. This meant that our Core API was essentially flapping up and down while it tried to reinstate all the tasks. During this period, our API was continuing to serve some requests, with increased latency, however, there would have been a large proportion of HTTP 502s still.

Following the above, our engineering team looked into the requests that were causing the increased load. From our investigation, it was apparent that the increased load was all to our environment document endpoint (which powers the local evaluation in our latest server side clients). This endpoint, although usable in our Core API, is very intensive as it generates the whole environment document from our PostgreSQL database to return to the client in JSON form. This involves a large number of queries.

The compounding factor was due to a bug in our Node client regarding request timeouts. The Node client takes an argument of requestTimeoutSeconds on instantiation, however, it passes this directly into the call to the Node Fetch’s library fetch function which expects the timeout to be passed in milliseconds. As such, if requestTimeoutSeconds was set to e.g. 3, the request would timeout in 3ms and retry (3 times by default). So, every time a Node client polled for the environment, it would be making 3 requests in ~9ms (or as close to it as Node can manage).

We were able to block the traffic to this endpoint for the customer that was putting an unusual amount of load through it due to their configuration and the above bug in the Node client. Once we had blocked this traffic, the application began serving traffic as normal again. This occurred at 15:24 UTC. At this point, traffic to the Core API was back to normal and all requests were served successfully.

To remediate this issue, we are stepping up our efforts to encourage all of our clients to move over to our Edge API which is immune to issues of this nature. We are also planning to make improvements to the existing Core API platform to help guard against these issues in the future:

The addition of caching to our environment document endpoint to improve performance / minimise database impact
The implementation of automated rate limiting to better protect the platform from issues of this nature

If you’ve read this and are unsure how to migrate to our Edge API, you can find out everything you need to know here.

Posted Aug 23, 2022 - 15:53 UTC

Resolved

Major outage of our Core API affecting flag retrieval for users that have not yet migrated to Edge and dashboard usage.

Posted Aug 18, 2022 - 13:00 UTC