On November 7th, between 4:33 PM and 5:00 PM UTC, we observed a number of HTTP 503 errors in response to API and Dashboard calls. This issue was caused by an internal network condition. Our monitoring tool detected the event and promptly alerted our internal communication system. To resolve the issue, we restarted all backend application servers.
On November 07, 2023, at approximately 4:33 PM UTC, an unusual network condition occurred in our internal network, despite no changes being made to our infrastructure in the hours leading up to the incident.
During the incident, around 16% of the total number of API and Dashboard calls resulted in errors. These errors were distributed among all endpoints and customers, based on the distribution and type of their traffic.
Our real-time monitoring system detected the first error at approximately 4:33 PM UTC. It promptly sent alerts to our internal communication channels. Within a minute, the System Administrators began investigating the issue. By 4:40 PM UTC, our Incident Response Team was actively working to mitigate the problem.
We have completed the investigation of our cloud infrastructure and have identified an inconsistency in specific timeout settings between our internet-facing load balancers and the internal network proxies. This inconsistency led to the specific condition we experienced. When this condition occurs, the internal proxies may close a certain number of connections, while the public load balancers continue to send incoming requests through them.
Between 4:40 PM and 4:55 PM UTC, the Team attempted several actions along the network pipeline in an effort to address the issue. This resulted in a partial mitigation of the problem. However, they soon realized that a more aggressive approach was necessary, leading to a restart of the runtime environment. By 5:00 PM UTC, all API and Dashboard calls were no longer encountering errors, and the Team considered the incident resolved.
To minimize, if not eliminate entirely, the likelihood of this issue recurring under similar circumstances in the future, we have made adjustments to the specific timeout settings identified during the root cause analysis phase.