Underline infrastructure issue impacting live services

Incident Report for Commerce Layer

Postmortem

Summary

Some users experienced HTTP 500 errors in response to Commence Layer API calls on January 17 between 4:02PM and 4:30PM UTC. A maintenance activity on one of our databases has triggered the issue. Our monitoring tool detected the event and alerts have been sent promptly to our internal communication system. We fixed the issue by closing stalled DB connections and re-starting the API backend processes that were affected.

Leadup

A maintenance activity was performed on one of our databases around 4:01PM UTC on January 17, 2023. It was completed without any errors in a few seconds, but shortly thereafter HTTP 500 errors began to appear in API backend logs and alerts began to appear on our internal communication system.

Fault

There were a number of API calls that returned errors. Upon investigation, the symptoms were caused by a backend runtime environment that was unable to connect to the database through connection pooling.

Detection

Our real-time monitoring system detected the first error at 4:01PM UTC and immediately sent alerts to our internal communication channels. By 4:02PM UTC, our System Administrators were already investigating the issue.

Root Cause

We have completed the investigation with our cloud service provider. We have identified an inconsistency in the db connection pooling service that may happen during some special backup tasks. We have now in place a different way to connect to the DB which will avoid any problems in the future during similar operations.

Mitigation and resolution

Between 4:05PM UTC 4:20PM UTC the Team attempted a few restarts of the affected runtime environment. This ended up with a partial mitigation of the issue. After that, they realized they needed to take a more aggressive approach, which led to a cold restart of the runtime environment and killing all database processes. By 4:30PM UTC all API calls stopped receiving errors and after another 15 minutes the Team deemed the incident as resolved.

Corrective and Preventative Measures

During the post-mortem analysis we found and tested a procedure that allows us to repeat the same DB activity without any errors.

Posted Jan 19, 2023 - 14:51 CET

Resolved

The issue can be considered as completely resolved.

Posted Jan 17, 2023 - 17:45 CET

Monitoring

Fix has been deployed and services' behavior is going back to normal. We are keeping the status monitored until complete recovery from the issue.

Posted Jan 17, 2023 - 17:31 CET

Identified

The issue has been identified, a fix is currently being implemented on the backend infrastructure.

Posted Jan 17, 2023 - 17:20 CET

Investigating

An issue generated on one of our backend cloud infrastructure components impacted our Core API and Dashboard environment. We are currently working to restore normal service functionality.

Posted Jan 17, 2023 - 17:02 CET

This incident affected: Commerce API and Dashboard.