zfs
zfs copied to clipboard
ZFS ate my data and hang my kernel
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Arch Linux |
| Distribution Version | rolling release |
| Kernel Version | 6.12.18-1-lts |
| Architecture | x86_64 |
| OpenZFS Version | zfs-2.3.1-1 zfs-kmod-2.3.1-1 |
Timeline
About 2 years ago I bought a single 4 TB WD Red Plus hard disk and created a ZFS pool called "data" using the sudo zpool create -f -o ashift=12 data wwn-diskidlotsofnumbers command. I then proceeded to create 2 encrypted datasets in the pool, and put some important data onto them.
Recently, I bought another one of the same exact disk, and added it to the "data" ZFS pool as a mirror using the sudo zpool attach data wwn-olddiskidlotsofnumbers wwn-newdiskidlotsofnumbers command. After this, ZFS started to resilver my disks. The resilvering was half-way complete when its progress stopped. I waited half an hour, and then tried to open files from my datasets, but the file manager / terminal became unresponsive when doing so, and (if I remember correctly) the entire PC froze up shortly after that, so I had to reboot the system.
Kernel logs from when this happened:
Mar 10 20:02:48 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:48 btw kernel: ata4.00: cmd 25/00:08:80:b7:c0/00:00:d1:01:00/e0 tag 26 dma 4096 in
res 61/04:08:80:b7:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:48 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:48 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:48 btw kernel: ata4: EH complete
Mar 10 20:02:48 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:48 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:48 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:48 btw kernel: ata4.00: cmd 25/00:08:80:b7:c0/00:00:d1:01:00/e0 tag 27 dma 4096 in
res 61/04:08:80:b7:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:48 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:48 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:48 btw kernel: ata4: EH complete
Mar 10 20:02:48 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:48 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:48 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:48 btw kernel: ata4.00: cmd 25/00:08:80:b7:c0/00:00:d1:01:00/e0 tag 28 dma 4096 in
res 61/04:08:80:b7:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:48 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:48 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:48 btw kernel: sd 3:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Mar 10 20:02:48 btw kernel: sd 3:0:0:0: [sda] tag#28 Sense Key : Illegal Request [current]
Mar 10 20:02:48 btw kernel: sd 3:0:0:0: [sda] tag#28 Add. Sense: Unaligned write command
Mar 10 20:02:48 btw kernel: sd 3:0:0:0: [sda] tag#28 CDB: Read(16) 88 00 00 00 00 01 d1 c0 b7 80 00 00 00 08 00 00
Mar 10 20:02:48 btw kernel: I/O error, dev sda, sector 7814035328 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar 10 20:02:48 btw kernel: Buffer I/O error on dev sda9, logical block 2032, async page read
Mar 10 20:02:48 btw kernel: ata4: EH complete
Mar 10 20:02:48 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:48 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:48 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:48 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 12 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:48 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:48 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:48 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:48 btw kernel: sd 3:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#12 Sense Key : Illegal Request [current]
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#12 Add. Sense: Unaligned write command
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#12 CDB: Read(16) 88 00 00 00 00 01 d1 c0 77 80 00 00 00 08 00 00
Mar 10 20:02:49 btw kernel: I/O error, dev sda, sector 7814018944 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 13 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 15 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 13 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 14 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 15 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: ata4: EH complete
Mar 10 20:02:49 btw kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 10 20:02:49 btw kernel: ata4.00: irq_stat 0x40000001
Mar 10 20:02:49 btw kernel: ata4.00: failed command: READ DMA EXT
Mar 10 20:02:49 btw kernel: ata4.00: cmd 25/00:08:80:77:c0/00:00:d1:01:00/e0 tag 16 dma 4096 in
res 61/04:08:80:77:c0/00:00:d1:01:00/e0 Emask 0x1 (device error)
Mar 10 20:02:49 btw kernel: ata4.00: status: { DRDY DF ERR }
Mar 10 20:02:49 btw kernel: ata4.00: error: { ABRT }
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: failed to enable AA (error_mask=0x1)
Mar 10 20:02:49 btw kernel: ata4.00: configured for UDMA/133 (device error ignored)
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#16 Sense Key : Illegal Request [current]
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#16 Add. Sense: Unaligned write command
Mar 10 20:02:49 btw kernel: sd 3:0:0:0: [sda] tag#16 CDB: Read(16) 88 00 00 00 00 01 d1 c0 77 80 00 00 00 08 00 00
Mar 10 20:02:49 btw kernel: I/O error, dev sda, sector 7814018944 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar 10 20:02:49 btw kernel: Buffer I/O error on dev sda1, logical block 976752112, async page read
Mar 10 20:02:49 btw kernel: ata4: EH complete
Describe the problem you're observing
After the previously described disaster happened, unless I physically remove/disconnect the new hard disk, the OS doesn't fully load and I get this message, which causes the kernel thread to hang:
VERIFY0(dmu_read(os, vim->vim_object, 0, map_size, vim->vim_entries, DMU_READ_PREFETCH)) failed (0 == 52)
PANIC at vdev_indirect_mapping.c:363:vdev_indirect_mapping_open()
If I do physically remove/disconnect the new hard drive, and only keep my old hard drive, the system loads, but I get the following error on zfs import:
cannot import 'data': one or more devices is currently unavailable
cachefile import failed, retrying
cannot import '(null)': no such pool available
Note: /dev/sdb in the logs refers to the new disk
Reconnect the new hard drive (I have hot-plugging set up) and trying to import the pool will result in the same VERIFY0... message in the kernel logs and the kernel thread hanging.
What I tried to debug the issue
- I connected the disks (both individually and together) to another PC
- I tried using Ubuntu 24.04 on both PCs and in a QEMU VM with the SATA controller passed through
In both cases, the results were exactly the same as on my Arch Linux setup.
I also tried to use ddrescue to clone my old disk to a raw .img file. It seems like it wasn't able to read anything from the disk, however I'm not 100% sure I wasn't using it wrong or misinterpreting its output.
However, I was able to clone the disk to an .img file on Windows.
On Linux, I then proceeded to mount the .img file in the Ubuntu 24.04 virtual machine, and run zfs import. It didn't work. Then, I passed through the SATA controller handling my new hard disk to the VM, and tried to import again. This resulted in my VM experiencing the infamous VERIFY0 kernel hanging problem. I also tried the same with FreeBSD, but in the case of FreeBSD, instead of the kernel hanging, the VM rebooted.
Describe how to reproduce the problem
I can reproduce this issue every time, however I couldn't find anyone else with the exact same problem as me, so it is likely something specific. If needed, I can let a ZFS developer connect to my VM to debug this issue.
Final words
Some of the important data in the ZFS pool was backed up elsewhere, but some of it (the relatively new stuff) wasn't yet. I hope my files are recoverable, as they shouldn't have been overwritten and still physically be stored on the disk. In fact, I know this isn't a physical failure of the disks, since I can see empty 0 portions and data portions in the .img file I took using a hex editor. So from my understanding, this must be a bug in ZFS, as it shouldn't be okay for a kernel module to cause a kernel thread to hang or the OS to reboot.
Hate to be that person, but DRDY DF ERR messages would indicate a hardware issue.
I agree. Looks like a hardware issue. Could even be the SATA cable.
It's not the SATA cable, nor the SATA controller, as I tested it with multiple cables on multiple PCs.
Could a hardware issue have caused the data become corrupted while resilvering? Certainly. But I don't think this is a hardware issue in the sense that there's anything wrong with the disks, since 1) the chances of 2 (relatively) new disks failing at the same time is low 2) Windows was able to clone the disk into an .img file. Linux and FreeBSD hang/reboot even when reading from the image file, which shouldn't happen, even if the data itself is corrupted (which it very well might be).
Where do you see both disks failed? Errors in the kernel log you posted concern only the first disk.
I have no doubt that this is hardware related. The ata errors you are seeing are low level. In my opinion this can never be done by a filesystem.
100% this is a drive failure of the original drive (or cable/power issue, but all signs lead to the drive itself from what you've posted). No filesystem can cause errors like that, they are low level command failures to/from the drive.
How sure are you that the new drive is a Red Plus? Because the symptoms are consistent with a WD Red SMR drive, see: #10214.