API outage

Incident Report for Flagsmith

Postmortem

Overview

On the 20th of January 2021, at 16:47 UTC, our REST API suffered a partial outage for 38 minutes, with partial service resuming over the course of 6 minutes, resulting in total downtime of 44 minutes. The core reason for the outage was a database migration that failed to apply correctly. We manually corrected the migration and service was resumed.

We’re really sorry for this downtime. We work hard to try to ensure 100% uptime, and will take on these learnings to improve the service into the future.

Background

As part of the development work around 3rd party integrations, we have been working on an integration with [Amplitude](https://amplitude.com/). This integration requires a new table to be created in the core Postgres database. Consequently a Django Database Migration was created to facilitate this.

As part of this work, one of our developers manually edited the migration to make a change to the data schema. This was an error; migrations should not be manually edited; the engineer should have created a second migration to modify the data schema.

We have also been migrating our code to use the Black python formatter. This caused issues with regards to our code review process by polluting the code review with additional formatting that made reading the code harder than it ought to be.

Testing

The code worked in our local, development and staging environments. This was due to the fact that test data was present in the prod environment but not on the development or staging environments. The migration failed to apply everywhere (because the app thought there was no migration to apply) but the exception was only thrown because there was data in the table in production.

Outage

Once our code review had progressed, we merged our code to master and the CI/CD pipelines pushed it into production. This caused the outage. We were also late to be alerted to this on account of it not taking down endpoints like /health; /health was still reporting 200 OK response codes.

Immediate Fix

We identified the issue quickly, wrote and tested a fix and then deployed it into production.

Learnings

We will ensure a 3rd set of eyes review any commits that include data migration code.
We will ensure a better consistency of data during testing.
Our downtime alerting has been improved to make synthetic API calls to core SDK endpoints like retrieving flags. This better simulates real world usage.

Posted Jan 25, 2021 - 19:41 UTC

Resolved

We've identified an issue with a database migration that failed in production as part of an upgrade. We've remedied the issue and are monitoring the platform. Will provide further updates as soon as we can.

We've identified the code that cause the outage and will be monitoring the platform but are confident that the issue has been fully resolved.

We will provide a post mortem as soon as we have completed the root cause analysis.

Posted Jan 20, 2021 - 17:30 UTC

This incident affected: Core API.