On March 26th, from 5:45 AM to 7:00 AM UTC, we noted several errors in response to API and Dashboard calls due to a database issue. Our monitoring tool identified the problem and immediately alerted our internal communication system. Upon investigating, we applied the needed fixes and restarted all backend application services, which restored their operations.
Later, between 9:10 AM and 9:40 AM UTC, we experienced degraded database performance as a residual effect of the main issue. This had a lighter impact on services than the initial incident.
On March 26th, 2024, at approximately 5:45 PM UTC, an unusual database condition occurred, despite no changes being applied to our infrastructure in the hours preceding the incident.
During the incident, approximately 30% of all API and Dashboard calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type.
Our real-time monitoring system first detected an error around 5:45 AM UTC and quickly sent out alerts via our internal communication channels. System Administrators started investigating the issue shortly thereafter. By 6:30 AM UTC, the Incident Response Team was actively addressing the problem.
We have finished investigating our database infrastructure and codebase. We identified specific timeout parameters that, under certain concurrent conditions, could lead to the issue we experienced.
When this issue arises, application transactions might fail to establish a database connection from the connection pool, resulting in an error. At the same time, the current auto-scaling parameters were not able to detect this specific temporary glitch, causing other transactions to receive a timeout before the database could complete the requested operation.
From 6:00 AM to 6:45 AM UTC, the Team examined the logs and monitoring systems to identify specific errors and possible immediate emergency actions. Recognizing the issue's impact, they opted for an aggressive approach, which led to a restart of the runtime environment. By around 7:00 AM UTC, all API and Dashboard calls were error-free. The Team continued to work on identifying the root cause to resolve the issue definitively, while keeping the situation monitored, aware that side effects could still potentially occur.
At 9:10 AM, they indeed observed a degradation in DB performance. In this case they applied a few mitigation measures, such as horizontal scaling and code hotfix to deploy a specific optimization.
Finally, a more suitable set of timeout parameters has been applied to the database infrastructure, allowing the team to consider the issue resolved.
To significantly lessen, or even eliminate the likelihood of this issue recurring, we've already implemented extra controls in our codebase and set up alerts for the pro-active detection and automated resolution of the specific condition. We are also developing additional auto-scaling criteria to add capacity should similar conditions occur in the future that are triggered for different reasons.