All systems are operational

Past Incidents

13th December 2024

No incidents reported

12th December 2024

VPS Hardware failure in Virtualization Server 12-12

RFO (Reason for Outage)

Summary

On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.

Timeline

Times are in CET:

  • 12:33: The virtualization machine reported a failure.
  • 12:59: Monitoring alerted engineers.
  • 13:39: VPSs were restarted, and customers were informed.
  • 14:12: Engineers were made aware that the restart was not successful for the majority of VPSs due to a locking issue preventing I/O. Locks were manually removed, and the systems were restarted again.
  • 14:45: VPSs were successfully restarted.

Root Cause

VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.

The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery

Affected services

  • Subset of customers VPSs

Corrective/Preventative Actions

Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented:

  • The time between the hardware failure and notification to our engineers exceeded our normal notification period. We will investigate and address the delays in notification.
  • The storage locking mechanism will be reviewed, and either corrected or the SOP will be adjusted to account for this behavior.

11th December 2024

No incidents reported

10th December 2024

No incidents reported

9th December 2024

No incidents reported

8th December 2024

No incidents reported

7th December 2024

No incidents reported