Hardware failure in Virtualization Server 12-12 Thursday 12th December 2024 12:33:00


RFO (Reason for Outage)

Summary

On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.

Timeline

Times are in CET:

Root Cause

VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.

The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery

Affected services

Corrective/Preventative Actions

Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented: