how I wound up causing a major outage of my services and destroying my home directory by accident

Feb 4, 2022

As a result of my FOSS maintenance and activism work, I have a significant IT footprint, to support the services and development environments needed to facilitate everything I do. Unfortunately, I am also my own system administrator, and I am quite terrible at this. This is a story about how I wound up knocking most of my services offline and wiping out my home directory, because of a combination of Linux mdraid bugs and a faulty SSD. Hopefully this will be helpful to somebody in the future, but if not, you can at least share in some catharsis.

A brief overview of the setup

As noted, I have a cluster of multiple servers, ranging from AMD EPYC machines to ARM machines to a significant interest in a System z mainframe which I talked about at AlpineConf last year. These are used to host various services in virtual machine and container form, with the majority of the containers being managed by kubernetes in the current iteration of my setup. Most of these workloads are backed by an Isilon NAS, but some workloads run on local storage instead, typically for performance reasons.

Using kubernetes seemed like a no-brainer at the time because it would allow me to have a unified control plane for all of my workloads, regardless of where (and on what architecture) they would be running. Since then, I’ve realized that the complexity of managing my services with kubernetes was not justified by the benefits I was getting from using kubernetes for my workloads, and so I started migrating away from kubernetes back to a traditional way of managing systems and containers, but many services are still managed as kubernetes containers.

A Samsung SSD failure on the primary development server

My primary development server, is named treefort. It is an x86 box with AMD EPYC processors and 256 GB of RAM. It had a 3-way RAID-1 setup using Linux mdraid on 4TB Samsung 860 EVO SSDs. I use KVM with libvirt to manage various VMs on this server, but most of the server’s resources are dedicated to the treefort environment. This environment also acts as a kubernetes worker, and is also the kubernetes controller for the entire cluster.

Recently I had a stick of RAM fail on treefort. I ordered a replacement stick and had a friend replace it. All seemed well, but then I decided to improve my monitoring so that I could be alerted to any future hardware failures, as having random things crash on the machine due to uncorrected ECC errors is not fun. In the process of implementing this monitoring, I learned that one of the SSDs had fallen out of the RAID.

I thought it was a little weird that one drive failed out of the three, so I assumed it was just due to maintenance, perhaps the drive had been reseated after the RAM stick was replaced, after all. As the price of a replacement 4TB Samsung SSD is presently around $700 retail, I thought I would re-add the drive to the array, assuming it would fail out of the array again during rebuild if it had actually failed.

# mdadm —-manage /dev/md2 —-add /dev/sdb3  
mdadm: added /dev/sdb3

I then checked /proc/mdstat and it reported the array as healthy. I thought nothing of it, though in retrospect maybe I should have found this suspicious, there was no discussion about the array being in a recovery state, instead it was healthy, with three drives present. Unfortunately, I figured “ok, I guess it’s fine” and left it at that.

Silent data corruption

Meanwhile, the filesystem in the treefort environment being backed by the local SSD storage for speed reasons, began to silently corrupt itself. Because most of my services, such as my mail server, DNS and network monitoring, are running on other hosts, there wasn’t really any indicator of anything wrong. Things seemed to be basically working fine: I had been compiling kernels all week long as I tested various mitigations for the execve(2) issue. What I didn’t know at the time was that with each kernel compile I was slowly corrupting the disk more and more.

I was not aware of the data corruption issue until today, anyway, when I logged into the treefort environment, and decided to fire up nano to finish up some work I had been doing that needed to be resolved this week. That led me to have a rude surprise:

treefort:~$ nano  
Segmentation fault

This worried me, after all, why could nano crash if it were working yesterday, and nothing had changed? So, I used apk fix to reinstall nano, making it work again. At this point, I was quite suspicious, that something was up with the server, so I immediately killed all the guests running on it, and focused on the bare metal host environment (what we would call the dom0 if we were still using Xen).

I ran e2fsck -f on the treefort volumes and hoped for the best. Instead of a clean bill of health, I got lots of filesystem errors. But this still didn’t make any sense to me, I checked the array again, and it was still showing as fully healthy. Accordingly, I decided to run e2fsck -fy on the volumes and hope for the best. This took out the majority of the volume storing my home directory.

The loss of the kubernetes controller

Kubernetes is a fickle beast, it assumes you have set everything up with redundancy including, of course, redundant controllers. I found this out the hard way when I took treefort offline, and the worker nodes got confused and took the services they were running offline as well, presumably because they were unable to talk to the controller.

Eventually, with some help from friends, I was able to recover enough of the volume to allow the system to boot enough to get the controller back up and running enough to restore the services on the workers that were not treefort, but much like the data in my home directory, the services that were running on treefort are likely permanently lost.

Some thoughts

First of all, it is obvious I need to improve my backup strategy from something other than “I’ll figure it out later”. I plan on packaging Rich Felker’s bakelite tool to do just that.

The other big elephant in the room, of course, is “why weren’t you using ZFS in the first place”. While it is true that Alpine has supported ZFS for years, I’ve been hesitant to use it due to the CDDL licensing. In other words, I chose the mantra instilled in me about GPL compatibility since the days when I was using GNU/Linux over pragmatism. And my prize for that decision was this mess. While I think Oracle and the Illumos and OpenZFS contributors should come together to relicense the ZFS codebase under MPLv2 to solve the GPL compatibility problem, I am starting to think that I should care more about having a storage technology I can actually trust.

I’m also quite certain that the issue I hit is a bug in mdraid, but perhaps I am wrong. I am told that there is a dirty bitmap system and perhaps if all bitmaps are marked clean on both the good pair of drives and the bad drive, it can cause this kind of split-brain issue, but I feel like there should be timestamping on those bitmaps to prevent something like this. It’s better to have an unnecessary rebuild because of clock skew than to go split brain and have 33% of all reads causing silent data corruption due to being out of sync with the other disks.

Nonetheless, my plans are to rebuild treefort with ZFS and SSDs from another vendor. Whatever happened with the Samsung SSDs has made me anxious enough that I don’t want to trust them for continued production use.