All systems are operational

About This Site

You can also find us here

For other news, please have a look at our blog pages greenhost.net/blog

Past Incidents

Monday 23rd December 2024

No incidents reported

Sunday 22nd December 2024

No incidents reported

Saturday 21st December 2024

No incidents reported

Friday 20th December 2024

No incidents reported

Thursday 19th December 2024

No incidents reported

Wednesday 18th December 2024

No incidents reported

Tuesday 17th December 2024

No incidents reported

Monday 16th December 2024

No incidents reported

Sunday 15th December 2024

No incidents reported

Saturday 14th December 2024

No incidents reported

Friday 13th December 2024

No incidents reported

Thursday 12th December 2024

VPS Hardware failure in Virtualization Server 12-12

RFO (Reason for Outage)

Summary

On the 12th of December, a virtualization server experienced a hardware failure, affecting customer VPSs partially and rendering them unreachable or unavailable.

Timeline

Times are in CET:

  • 12:33: The virtualization machine reported a failure.
  • 12:59: Monitoring alerted engineers.
  • 13:39: VPSs were restarted, and customers were informed.
  • 14:12: Engineers were made aware that the restart was not successful for the majority of VPSs due to a locking issue preventing I/O. Locks were manually removed, and the systems were restarted again.
  • 14:45: VPSs were successfully restarted.

Root Cause

VPSs run on virtualization machines (physical hardware). One of these machines experienced a failure, rendering all VPSs hosted on it unavailable.

The standard procedure is to start VPSs on other systems to recover. However, due to a new feature of the storage cluster software (CEPH), a lock on the virtual disks was set, preventing full I/O capabilities on restart. This unexpected behavior caused the standard procedure to fail and required additional corrective action, delaying the recovery

Affected services

  • Subset of customers VPSs

Corrective/Preventative Actions

Although hardware failures are inevitable, the time taken to recover exceeded expectations. To reduce recovery time, the following actions will be implemented:

  • The time between the hardware failure and notification to our engineers exceeded our normal notification period. We will investigate and address the delays in notification.
  • The storage locking mechanism will be reviewed, and either corrected or the SOP will be adjusted to account for this behavior.

Wednesday 11th December 2024

No incidents reported

Tuesday 10th December 2024

No incidents reported

Monday 9th December 2024

No incidents reported