Core API is not responding
Incident Report for Flagsmith
Postmortem

Summary

On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary.

Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC.

Next Steps

We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future.

https://www.postgresql.org/docs/11/runtime-config-client.html

https://www.postgresql.org/docs/11/runtime-config-logging.html (log_lock_waits)

Posted Sep 12, 2023 - 16:51 UTC

Resolved
This incident has been resolved. A postmortem will follow.
Posted Sep 05, 2023 - 12:01 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 05, 2023 - 10:25 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 05, 2023 - 10:25 UTC
Identified
We have identified a database migration that has failed as part of a new release. We are working to re-apply the migration.
Posted Sep 05, 2023 - 09:51 UTC
Investigating
We are currently investigating this issue.
Posted Sep 05, 2023 - 09:45 UTC
This incident affected: Core API.