“Is this the one?”
“Yeah, that’s the one.”
Yank!

As the last of the servers died with a pathetic beep, I think I heard the poor electrician, still holding the unlabeled end of the rack’s power line, invoke the name of Jesus.

(Obviously dramatized, but mostly accurate.)

The virtualization servers came back online with some fuss, but they at least look functional. My tasks for the day are set: wade through about 400 error messages, verify the functionality and integrity of 117 virtual machines, restore backups as needed, and verify the SMART status on every physical hard drive.

(edit 1) High Availability tried to migrate all of one host’s VMs to the other, but it isn’t worth much if both HA hosts are on the same circuit and die within seconds of each other. Now all but a few VMs are running on a single host.

(edit 2) Some of the PhDs are angry because their long-running ML projects got interrupted. They didn’t set up checkpoints or live backups, so entirely their fault.

(edit 3) Five hours later, only one VM needed manual intervention (apart from migrating the VMs back to their original hosts), and all the hard drives are in good condition for their age. This turned out to be a really boring disaster.

  • 1995ToyotaCorolla@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    10 days ago

    You just triggered my PTSD. We had a leg of our three phase burn out in juuuust the right way where our UPS wouldn’t accept the dirty power but it didn’t trigger the backup generator. The electrician didn’t arrive in time to manually engage the transfer switch, and the batteries ran down on the ups before we could get everything safely shut down. I don’t think I’ve encountered a quiet quite like that since