Slow response times for Edge API requests

Incident Report for Flagsmith

Postmortem

Timeline

At 12:15pm UTC, we were notified of increased response times on a number of our Edge API endpoints. Investigation showed nothing immediately obvious but we suspected that it could be caused by Sentry, our APM tool. We set about removing the Sentry initialisation from our code and deployed it as soon as we could.

At 12:48pm UTC, this change was deployed and we observed the response times decrease immediately.

At 12:52pm UTC our monitoring confirmed that the average response time had returned to normal.

Next Steps

  • Look into improvements to reduce / remove the impact of Sentry issues on our Edge API.

    • Decrease the shutdown timeout of the Sentry SDK.
    • Look at using Sentry relay to remove the impact on core Edge API services.

  • Add integration tests to simulate performance degradation / outages from all downstream services.
Posted Jul 10, 2023 - 13:42 UTC

Resolved

This incident has been resolved.
Posted Jul 10, 2023 - 12:58 UTC

Monitoring

The downstream service has been successfully removed. Response times have returned to normal. We are continuing to monitor the situation.
Posted Jul 10, 2023 - 12:50 UTC

Identified

We have identified an issue caused by a downstream service which is causing a knock on effect to our performance. We are currently deploying a change to remove the downstream service.
Posted Jul 10, 2023 - 12:44 UTC

Investigating

We are currently investigating this issue.
Posted Jul 10, 2023 - 12:29 UTC
This incident affected: Edge API.