Environment based integrations not working for Edge API
Incident Report for Flagsmith
Postmortem

On June 7th at 10:33 UTC we released a change to our Edge API as part of this issue that filters out server-side only features when a client API key is used. These changes also affected the logic responsible for triggering environment-level integrations, causing them to fail.

Which integrations were affected?

  • Mixpanel
  • Segment
  • Heap
  • Webhooks
  • Amplitude

On June 15th at 22:25 UTC we were notified by a customer that they were not seeing data populated from their integration. First thing on June 16th the engineering team began investigating the issue. The team immediately identified that the change described above had changed the signature of a function that was also used by the integrations logic. Unfortunately, this change had not been picked up by our tests or code review, and the subsequent errors were not picked up by our monitoring.

Why didn’t our tests pick this up?

The unit tests covering the integrations logic utilised mocking, meaning that the change to the method signature was not correctly identified as an issue and our end to end test suite did not include the verification of successful integrations.

Why didn’t our monitoring pick this up?

The monitoring in place to track the error rate on the function responsible was using an incorrect aggregation algorithm meaning that the threshold for alert was never breached.

At 18:03 UTC on June 16th a fix was released and service was resumed to all environment-based identity integrations (listed above). Unfortunately, due to the implementation of the asynchronous call to the lambda function that handles the integrations, it is not possible to recover the data that was lost during this period.

What are we doing to prevent this from happening in the future

  • Improving our unit tests to rely less on mocks and, where they do rely on mocks, ensuring they utilise spec correctly (see unittest documentation here for further reading)
  • Extending our E2E testing suite on the Edge API to include tests for all integrations.
  • Alerting & monitoring:

    • Immediately this morning, we are changing our alerting to use the correct aggregation algorithm
    • In the near future, we will be improving these alerts to use percentiles and anomaly detection to ensure that errors are picked up quicker and are more accurate
    • Introducing Sentry to better track error logs that are reported from the Edge API
  • Moving asynchronous invocations of other lambda functions to use persistent queues and/or an event messaging system so that, after issues such as this, tasks can be re-run to ensure no data is lost.

Posted Jun 21, 2023 - 10:26 UTC

Resolved
Issue has been resolved. Integrations all tested as working. Post-Mortem report to come next week.
Posted Jun 16, 2023 - 18:13 UTC
Identified
We are aware of issues firing environment based integrations in our Edge API. We have identified a fix and are working on the testing now. Performance of the Edge API itself is not affected.
Posted Jun 16, 2023 - 17:40 UTC
This incident affected: Edge API.