RFO (Reason for Outage)
Summary
On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.
Timeline
Times are in CET:
- 12:33: The virtualization machine reported a failure.
- 12:59: Monitoring alerted engineers.
- 13:39: VPSs were restarted, and customers were informed.
- 14:12: Engineers were made aware that the restart was not successful for the majority of VPSs due to a locking issue preventing I/O. Locks were manually removed, and the systems were restarted again.
- 14:45: VPSs were successfully restarted.
Root Cause
VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.
The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery
Affected services
Corrective/Preventative Actions
Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented:
- The time between the hardware failure and notification to our engineers exceeded our normal notification period. We will investigate and address the delays in notification.
- The storage locking mechanism will be reviewed, and either corrected or the SOP will be adjusted to account for this behavior.