dctrud's Random Road

Occasional unimportant nonsense.

2020-05-17 - A Bad Computer Day

I've had a couple of days away from #100DaysToOffload due to having some bad luck with computers recently, so let's get back on track with an incoherent rant about that :-)

On Friday the Samsung EVO 970 NVMe SSD in my desktop machine I use for work started behaving badly. Big I/O on the SSD when creating some new VMs led to errors in 'dmesg' output and Linux deciding it needed to be reset. Of course it wouldn't reset properly, and then my '/' partition that's on that drive would be unavailable. There are some well documented issues with NVMe resets on older kernels, but this thing has been completely stable for me up until now, and I've not changed kernel since before the issues started. Also pointing toward this being a drive failure... the system stays up and semi-usable since I have '/home' and '/data' on different drives. The other PCIe devices are working fine, including an Infiniband card and GPU - so it doesn't really look like any motherboard or PSU issue.

Saturday started with me re-seating the NVMe and trying to copy some stuff off. I take backups regularly, but I'm missing a couple of little things since the last one. The drive wasn't having this and started dissappearing even more quickly, without much prodding. A couple of attempts to get the data off and it wouldn't even boot to Linux reliably any more. Not a huge problem, I can reinstall to a spare 1TB HDD instead. It's slow, but it'll work, and the EVO drive is in warranty so I can get a replacement.

At this point a parcel showed up on the door-step that I wasn't expecting until mid-week. It's a used but still very nice Lenovo P700 workstation, with Dual Xeon E5-2680v3 cpus. This was going to be a compact and quieter alternative to having a used rackmount server to run a lot of VMs on, with access to SR-IOV for proper Infiniband networking between VMs etc. My work is such that I end up simulating HPC cluster stuff on VMs. It's somewhat easier to do that locally than in the cloud, and I like messing with this stuff outside of work too. Decided that, as my desktop needs a re-install anyway due to the drive issue, I could switch over to having this workstation as my desktop machine.

All is good, the Lenovo P700 is very nice. I haven't had a 'proper' workstation at home before and this one is really well built, very quiet considering what's in it, and very fast. The E5-2680v3 CPUs are not right up-to-date, but much of the speed gain with newer CPUs now comes through more cores, rather than per-core improvements. A newer CPU can give you (many) more cores, or the same number of cores with a fairly modest speed bump and less power draw. It's easily possible to have 24 cores on a single CPU these days, but for my purposes having multiple CPUs and more PCIe lanes for Infiniband and 2 types of GPU is beneficial. Plus - a roughly 4/yo workstation turns out to be great value.

Got Fedora 32 onto the new workstation quickly and it's fun to see 24 cores / 48 threads in 'htop'. Time to move one of the RAID HDD mirrors in my old desktop into the machine. Unfortunately I have 2 RAID mirrors, and moved one drive from each mirror, instead of both drives from one mirror. Ouch. Linux mdraid is fairly forgiving though so it's a case of putting the correct disks in, and then...

# Add back the correct second drive this time
mdadm /dev/md0 -a /dev/sdc1

# Watch it re-sync
watch cat /proc/mdstat

# Kick off a full check to feel a bit happier about things
echo check > /sys/block/md0/md/sync_action

This worked out okay, no data lost, no need to go to my backups. However I did, in this process, manage to break a SATA port on the motherboard of my old desktop as I shuffled around the drives. Locking SATA connectors are good... unless they are very cheap and the release tab doesn't work very well. Then if your motherboard has very cheap SATA ports on it, you might end up pulling one straight off the board - or at least the plastic part as the pins stayed on.

Most of a day lost to messing around, but I have all my stuff in place now.

Index of Posts