Hardware failure in Virtualization Server 12-12

RFO (Reason for Outage)

Summary

On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.

Timeline

Times are in CET:

12:33: The virtualization machine reported a failure.
12:59: Monitoring alerted engineers.
13:39: VPSs were restarted, and customers were informed.
14:12: Engineers were made aware that the restart was not successful for the majority of VPSs due to a locking issue preventing I/O. Locks were manually removed, and the systems were restarted again.
14:45: VPSs were successfully restarted.

Root Cause

VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.

The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery

Affected services

Subset of customers VPSs

Corrective/Preventative Actions

Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented: