Our response began immediately when availability monitors detected the possibility of an issue with Clearly Cloud operations. It was confirmed that some Clearly Cloud operations were impacted and the NOC team worked to restore full functionality as quickly as possible. As service was restored, teams began to identify the sequence of events which caused the description and root causes.
Ultimately, it was determined that a core network switch malfunctioned for a very brief period, automatically triggering protective high-availability systems. This led to a number of resources shifting to work around the perceived issue, leading to manual intervention due to the nature of the actual equipment failure. ClearlyIP has initiated a proactive replacement of all switches of the same type, revised incident response protocols, and improved processes which will both reduce the likelihood of recurrence and significantly reduce recovery times for other situations.