On April 9th, from 6:45 AM to 8:55 AM UTC, during the scheduled maintenance window, we encountered a database issue. The database version upgrade process failed and the automatic recovery service provided by our cloud infrastructure provider did not activate. We immediately contacted the infrastructure provider’s support team. They addressed this problem and there was no resulting data loss.
On April 9th, 2024, we began the DB upgrade maintenance at 6:39 AM UTC. At approximately 6:45 AM UTC, the background task provided by the DB cloud service failed, leaving the entire upgrade procedure stuck in an intermediate state.
During the incident, all Core API calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type. The read requests cached on our CDN continued to work without errors.
A few minutes after beginning maintenance, our engineers realized that the automatic upgrade process was not responding. The cloud service provides safeguards during upgrades that automatically restores normal operations should a timeout exceed 5 minutes. This expected action did not occur.
The service provider's support team confirmed that the upgrade process, once initiated, encountered an unexpected issue within their internal procedure, causing the upgrade to stall.
The automatic recovery safeguard began as expected five minutes after the incident. However, it couldn't complete because one of the read replicas was in an inconsistent state. The vendor is still investigating the cause of this issue.
Since the database service is fully managed by the provider, our engineers had to wait for the provider's support team to restore the normal level of availability.
A few minutes before 7:00 AM UTC, our engineers tried to manually interrupt the process, but the option to cancel it was disabled by the vendor. Our team immediately engaged the service provider’s support team, requesting to interrupt the process and restore the availability of the DB, with the highest urgency.
The provider’s support team started investigating through their own tools and after a few minutes suggested a couple of workarounds that didn’t succeed.
In parallel, we began our own restoration process from a backup.
At about 8:50 AM UTC, the service provider’s support team was able to restore the DB availability and our services gradually recovered to normal operation levels. At approximately 8:55 AM UTC, the entire resolution was completed.
Given the nature of the cloud infrastructure on which our services are built upon, not all operational steps are in our full control. However, we identified improvements that we intend to implement on processes and procedures, together with our service provider: