On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.
Times are in CET:
VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.
The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery
Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented: