During the morning of Wednesday May 14, we began receiving reports of intermittent issues from Clearly Cloud users in parts of the US with registrations and call completion. The engineering team immediately began investigating and closely monitored the performance of all related systems for the next several hours, identifying the cause of the problem.
A full analysis confirmed that call signaling messages were being jammed up during three timeframes, each approximately ~13 minutes, ~6 minutes, and ~5 minutes in duration. The cause of the problem was a replica (spare) database server unable to keep up with the synchronization from the primary systems, causing widespread delays and backups completing tasks like registration handling.
Once the cause was identified as a non-production server, our team began to take immediate action by disabling its role and disconnecting it from the production systems while evaluating what was behind its performance problem. This prevented additional issues beyond those experienced. Metrics and logs indicated a likely hardware issue. Thankfully, ClearlyIP's recent investments in improving its Central US datacenter operations meant our team could quickly move the replica database to this newer environment.
That migration was completed several hours after the issues began, situating the replica server in the new environment. It was then restored to service late afternoon. No similar Clearly Cloud issues were observed after the replica server was taken out of service, or since it was restored to service. Our teams will continue to proactively monitor the performance of these systems and study additional improvements which can minimize the impact of similar circumstances for the future.