Greenhost Status

No incidents reported

VPS Hardware failure in Virtualization Server 12-12

RFO (Reason for Outage)

Summary

On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.

Timeline

Times are in CET:

12:33: The virtualization machine reported a failure.
12:59: Monitoring alerted engineers.
13:39: VPSs were restarted, and customers were informed.
14:12: Engineers were made aware that the restart was not successful for the majority of VPSs due to a locking issue preventing I/O. Locks were manually removed, and the systems were restarted again.
14:45: VPSs were successfully restarted.

Root Cause

VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.

The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery

Affected services

Subset of customers VPSs

Corrective/Preventative Actions

Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented:

The time between the hardware failure and notification to our engineers exceeded our normal notification period. We will investigate and address the delays in notification.
The storage locking mechanism will be reviewed, and either corrected or the SOP will be adjusted to account for this behavior.

No incidents reported

About This Site

Past Incidents

Monday 23rd December 2024

Sunday 22nd December 2024

Saturday 21st December 2024

Friday 20th December 2024

Thursday 19th December 2024

Wednesday 18th December 2024

Tuesday 17th December 2024

Monday 16th December 2024

Sunday 15th December 2024

Saturday 14th December 2024

Friday 13th December 2024

Thursday 12th December 2024

RFO (Reason for Outage)

Summary

Timeline

Root Cause

Affected services

Corrective/Preventative Actions

Wednesday 11th December 2024

Tuesday 10th December 2024

Monday 9th December 2024