Partial Outage

Incident Report for Commerce Layer

Postmortem

Summary

On March 26th, from 5:45 AM to 7:00 AM UTC, we noted several errors in response to API and Dashboard calls due to a database issue. Our monitoring tool identified the problem and immediately alerted our internal communication system. Upon investigating, we applied the needed fixes and restarted all backend application services, which restored their operations.

Later, between 9:10 AM and 9:40 AM UTC, we experienced degraded database performance as a residual effect of the main issue. This had a lighter impact on services than the initial incident.

Leadup

On March 26th, 2024, at approximately 5:45 PM UTC, an unusual database condition occurred, despite no changes being applied to our infrastructure in the hours preceding the incident.

Fault

During the incident, approximately 30% of all API and Dashboard calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type.

Detection

Our real-time monitoring system first detected an error around 5:45 AM UTC and quickly sent out alerts via our internal communication channels. System Administrators started investigating the issue shortly thereafter. By 6:30 AM UTC, the Incident Response Team was actively addressing the problem.

Root Cause

We have finished investigating our database infrastructure and codebase. We identified specific timeout parameters that, under certain concurrent conditions, could lead to the issue we experienced.

When this issue arises, application transactions might fail to establish a database connection from the connection pool, resulting in an error. At the same time, the current auto-scaling parameters were not able to detect this specific temporary glitch, causing other transactions to receive a timeout before the database could complete the requested operation.

Mitigation and resolution

From 6:00 AM to 6:45 AM UTC, the Team examined the logs and monitoring systems to identify specific errors and possible immediate emergency actions. Recognizing the issue's impact, they opted for an aggressive approach, which led to a restart of the runtime environment. By around 7:00 AM UTC, all API and Dashboard calls were error-free. The Team continued to work on identifying the root cause to resolve the issue definitively, while keeping the situation monitored, aware that side effects could still potentially occur.

At 9:10 AM, they indeed observed a degradation in DB performance. In this case they applied a few mitigation measures, such as horizontal scaling and code hotfix to deploy a specific optimization.

Finally, a more suitable set of timeout parameters has been applied to the database infrastructure, allowing the team to consider the issue resolved.

Corrective and Preventative Measures

To significantly lessen, or even eliminate the likelihood of this issue recurring, we've already implemented extra controls in our codebase and set up alerts for the pro-active detection and automated resolution of the specific condition. We are also developing additional auto-scaling criteria to add capacity should similar conditions occur in the future that are triggered for different reasons.

Posted Mar 28, 2024 - 18:28 CET

Resolved

The issue is resolved and the incident, closed. We're working on the root cause analysis, a post-mortem report will be available soon.

Posted Mar 26, 2024 - 15:50 CET

Monitoring

A fix has been applied. We'll keep monitoring the situation in the next hours.

Posted Mar 26, 2024 - 12:32 CET

Investigating

We are experiencing a performance degradation of our API and Dashboard services. We are currently applying mitigation measures while we keep investigating the root cause and will provide updates here.

Posted Mar 26, 2024 - 10:41 CET

Monitoring

We've experienced a partial outage on our API services and Dashboard, starting from around 6:45 AM CET to 8:00 AM CET. Our engineers have restored the service and they are currently investigating on the root cause.

Posted Mar 26, 2024 - 09:35 CET

This incident affected: Commerce API and Dashboard.