Or: all that reseating, cleaning, and slot swapping for a kernel parameter fix.
The problem
I have a Samsung 990 PRO 4TB as the boot/rpool NVMe in my homelab server (bastion — ZFS everywhere, MicroVMs, 24/7 uptime). It started randomly disappearing. No warning, no graceful degradation — just gone. dmesg would light up with the NVMe controller giving up, followed by I/O errors on every operation that was in flight:
[204263.471182] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[204283.495338] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[204283.569360] I/O error, dev nvme0n1, sector 2623470496 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[204283.569365] I/O error, dev nvme0n1, sector 3026384936 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[204283.569369] I/O error, dev nvme0n1, sector 3025352880 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[204283.569369] I/O error, dev nvme0n1, sector 2623481976 op 0x1:(WRITE) flags 0x0 phys_seg 11 prio class 2
[204496.790887] systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
Then swap would start failing because the device backing it was gone:
[248903.580981] Read-error on swap-device (254:1:78162624)
[248903.590087] Read-error on swap-device (254:1:78162632)
[248903.599150] Read-error on swap-device (254:1:78162640)
...
ZFS would notice the drive had vanished and mark the vdev as FAULTED. Game over until reboot.