Some users experienced HTTP 500 errors in response to Commence Layer API calls on January 17 between 4:02PM and 4:30PM UTC. A maintenance activity on one of our databases has triggered the issue. Our monitoring tool detected the event and alerts have been sent promptly to our internal communication system. We fixed the issue by closing stalled DB connections and re-starting the API backend processes that were affected.
A maintenance activity was performed on one of our databases around 4:01PM UTC on January 17, 2023. It was completed without any errors in a few seconds, but shortly thereafter HTTP 500 errors began to appear in API backend logs and alerts began to appear on our internal communication system.
There were a number of API calls that returned errors. Upon investigation, the symptoms were caused by a backend runtime environment that was unable to connect to the database through connection pooling.
Our real-time monitoring system detected the first error at 4:01PM UTC and immediately sent alerts to our internal communication channels. By 4:02PM UTC, our System Administrators were already investigating the issue.
We have completed the investigation with our cloud service provider. We have identified an inconsistency in the db connection pooling service that may happen during some special backup tasks. We have now in place a different way to connect to the DB which will avoid any problems in the future during similar operations.
Between 4:05PM UTC 4:20PM UTC the Team attempted a few restarts of the affected runtime environment. This ended up with a partial mitigation of the issue. After that, they realized they needed to take a more aggressive approach, which led to a cold restart of the runtime environment and killing all database processes. By 4:30PM UTC all API calls stopped receiving errors and after another 15 minutes the Team deemed the incident as resolved.
During the post-mortem analysis we found and tested a procedure that allows us to repeat the same DB activity without any errors.