API/Dashboard/Applications back to normal

Incident Report for Commerce Layer

Postmortem

Summary

On April 9th, from 6:45 AM to 8:55 AM UTC, during the scheduled maintenance window, we encountered a database issue. The database version upgrade process failed and the automatic recovery service provided by our cloud infrastructure provider did not activate. We immediately contacted the infrastructure provider’s support team. They addressed this problem and there was no resulting data loss.

Leadup

On April 9th, 2024, we began the DB upgrade maintenance at 6:39 AM UTC. At approximately 6:45 AM UTC, the background task provided by the DB cloud service failed, leaving the entire upgrade procedure stuck in an intermediate state.

Fault

During the incident, all Core API calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type. The read requests cached on our CDN continued to work without errors.

Detection

A few minutes after beginning maintenance, our engineers realized that the automatic upgrade process was not responding. The cloud service provides safeguards during upgrades that automatically restores normal operations should a timeout exceed 5 minutes. This expected action did not occur.

Root Cause

The service provider's support team confirmed that the upgrade process, once initiated, encountered an unexpected issue within their internal procedure, causing the upgrade to stall.

The automatic recovery safeguard began as expected five minutes after the incident. However, it couldn't complete because one of the read replicas was in an inconsistent state. The vendor is still investigating the cause of this issue.

Since the database service is fully managed by the provider, our engineers had to wait for the provider's support team to restore the normal level of availability.

Mitigation and resolution

A few minutes before 7:00 AM UTC, our engineers tried to manually interrupt the process, but the option to cancel it was disabled by the vendor. Our team immediately engaged the service provider’s support team, requesting to interrupt the process and restore the availability of the DB, with the highest urgency.

The provider’s support team started investigating through their own tools and after a few minutes suggested a couple of workarounds that didn’t succeed.

In parallel, we began our own restoration process from a backup.

At about 8:50 AM UTC, the service provider’s support team was able to restore the DB availability and our services gradually recovered to normal operation levels. At approximately 8:55 AM UTC, the entire resolution was completed.

Corrective and Preventative Measures

Given the nature of the cloud infrastructure on which our services are built upon, not all operational steps are in our full control. However, we identified improvements that we intend to implement on processes and procedures, together with our service provider:

We will directly involve the service provider’s representatives in future maintenance operations, from planning through implementation.
We will update our status page as soon as the issue occurs. We will extend this by sending a proactive alert to organization owners.
We will schedule the maintenance window during a low traffic period on our platform.

Posted Apr 22, 2024 - 16:09 CEST

Resolved

The issue is resolved and the incident, closed. We're working on the root cause analysis, a post-mortem report will be available soon.

Posted Apr 09, 2024 - 13:14 CEST

Monitoring

A fix has been applied. We are monitoring the situation and will publish a detailed incident report.

Posted Apr 09, 2024 - 11:06 CEST

Identified

We are working with our service provider to resolve the issue and restore the normal operations.

Posted Apr 09, 2024 - 10:36 CEST

Update

We are continuing to investigate this issue.

Posted Apr 09, 2024 - 09:26 CEST

Investigating

We are aware that our API/Dashboard/Applications are currently not available. We are investigating it with the highest priority and will provide updates here.

Posted Apr 09, 2024 - 09:25 CEST

This incident affected: Commerce API, Dashboard and Apps (Checkout).