Performance Degradation

Incident Report for Commerce Layer

Postmortem

Summary

On November 7th, between 4:33 PM and 5:00 PM UTC, we observed a number of HTTP 503 errors in response to API and Dashboard calls. This issue was caused by an internal network condition. Our monitoring tool detected the event and promptly alerted our internal communication system. To resolve the issue, we restarted all backend application servers.

Leadup

On November 07, 2023, at approximately 4:33 PM UTC, an unusual network condition occurred in our internal network, despite no changes being made to our infrastructure in the hours leading up to the incident.

Fault

During the incident, around 16% of the total number of API and Dashboard calls resulted in errors. These errors were distributed among all endpoints and customers, based on the distribution and type of their traffic.

Detection

Our real-time monitoring system detected the first error at approximately 4:33 PM UTC. It promptly sent alerts to our internal communication channels. Within a minute, the System Administrators began investigating the issue. By 4:40 PM UTC, our Incident Response Team was actively working to mitigate the problem.

Root Cause

We have completed the investigation of our cloud infrastructure and have identified an inconsistency in specific timeout settings between our internet-facing load balancers and the internal network proxies. This inconsistency led to the specific condition we experienced. When this condition occurs, the internal proxies may close a certain number of connections, while the public load balancers continue to send incoming requests through them.

Mitigation and resolution

Between 4:40 PM and 4:55 PM UTC, the Team attempted several actions along the network pipeline in an effort to address the issue. This resulted in a partial mitigation of the problem. However, they soon realized that a more aggressive approach was necessary, leading to a restart of the runtime environment. By 5:00 PM UTC, all API and Dashboard calls were no longer encountering errors, and the Team considered the incident resolved.

Corrective and Preventative Measures

To minimize, if not eliminate entirely, the likelihood of this issue recurring under similar circumstances in the future, we have made adjustments to the specific timeout settings identified during the root cause analysis phase.

Posted Nov 10, 2023 - 18:16 CET

Resolved

Performance has been restored to the normal range. We are continuing to monitor the situation.

Posted Nov 07, 2023 - 18:00 CET

Investigating

We're currently experiencing partial performance degradation on our API services and Dashboard, starting from around 5:33 pm CET. Our engineers are currently investigating.

Posted Nov 07, 2023 - 17:33 CET

This incident affected: Commerce API and Dashboard.