zfs CKSUM and WRITE errors when receiving snapshots or scrubbing (2.2.4, LUKS)

This is probably still #15533, but making a separate issue just in case.

System information

Type	Version/Name
Distribution Name	NixOS
Distribution Version	23.11
Kernel Version	6.1.65
Architecture	x86_64
OpenZFS Version	zfs-2.2.2-1 zfs-kmod-2.2.2-1

Describe the problem you're observing

When receiving snapshots, I get hundreds of WRITE errors and many CKSUM errors, even after updating to 2.2.2. While all of my leaf vdevs are on LUKS, the errors only happen on vdevs that are connected to one of my mpt3sas controllers, never on vdevs connected to the onboard ahci.

Before updating to 2.2.2, I was briefly on 2.2.1, and I got many WRITE and CKSUM errors from normal writes (probably qbittorrent). More context about my situation, starting around a week ago:

Migrated pool “ocean” from FreeBSD to Linux
- from FreeBSD 13.2-RELEASE (zfs-2.1.9-FreeBSD_g92e0d9d18), with all leaf vdevs on geli(4)
- to NixOS 23.05 (zfs 2.1.12), with all leaf vdevs on LUKS2
- technically a new pool now (split + destroy + create + send/recv + destroy + attach)
Upgraded to NixOS 23.11 (zfs 2.2.1)
Lots of WRITE and CKSUM errors from normal writes
Updated to zfs 2.2.2
The export/reboot/import started a resilver with an ETA of over 10 days(!)
Managed to stop it by setting zfs_no_scrub_io, detaching, attaching, and unsetting
Waited a day or so for the sequential resilver and scrub to finish, no errors!
Started receiving snapshots from a machine running NixOS 23.05 (zfs-2.1.12-1)
Lots of WRITE and CKSUM errors from the receive into “ocean”
Started receiving snapshots from pool “cuffs” on the same machine (zfs-2.2.2-1)
Lots of WRITE and CKSUM errors from the receive into “ocean”
Started sending snapshots to pool “cuffs” on the same machine (zfs-2.2.2-1)
No errors from the receive into “cuffs”

Describe how to reproduce the problem

I have backups, so I’m happy to split my pool and create a test pool over half as many disks if needed.

Include any warning/errors/backtraces from the system logs

The leaf vdev names look funky in these logs, because the pool was imported with -d /dev/disk/by-id but the vdevs were reattached from /dev/mapper:

The rest look nicer because I reimported “ocean” with -d /dev/mapper:

My disks should be fine, and the UDMA CRC error counts while non-zero are stable:

$ smart. /dev/sd?
sda          sdb          sdc          sdd          sde          sdf          sdg          sdh          sdi          sdj          sdk          sdl          sdm          sdn
                          ocean2x0     ocean3x0     ocean1x0     ocean0x0     star1x0      star0x0      ocean4x0     ocean4x1     ocean3x1     ocean2x2     ocean1x1     ocean0x1
PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       overall-health
0            0            0            0            0            0            0            0            0            0            0            0            0            0            Reallocated_Sector_Ct
54468        41829        59659        59644        59653        59428        19000        19000        1425         1425         59663        19964        59638        59225        Power_On_Hours
                          0            0            0            0            0            0            0            0            0            0            0            0            Reallocated_Event_Count
0            0            0            0            0            0            0            0            0            0            0            0            0            0            Current_Pending_Sector
                          0            0            0            0            0            0            0            0            0            0            0            0            Offline_Uncorrectable
162071       0            2495         158          1764         2515         0            0            0            0            944          0            199          0            UDMA_CRC_Error_Count
                          227527       242541       239113       9108         2108         2065         53           53           246864       246          241566       9002         Load_Cycle_Count
37           39           42           44           44           42           49           51           46           47           39           40           42           43           Temperature_Celsius

Aside from the special mirror /dev/mapper/oceanSx{0,1}, half of each mirror is connected to one mpt3sas (0000:01:00.0) and the other half on another mpt3sas (0000:06:00.0):

$ ls -l /sys/block/sd?
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sda -> ../devices/pci0000:00/0000:00:1f.2/ata3/host3/target3:0:0/3:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdb -> ../devices/pci0000:00/0000:00:1f.2/ata2/host2/target2:0:0/2:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdc -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdd -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sdd
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sde -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:2/end_device-0:2/target0:0:2/0:0:2:0/block/sde
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdf -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:3/end_device-0:3/target0:0:3/0:0:3:0/block/sdf
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdg -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:4/end_device-0:4/target0:0:4/0:0:4:0/block/sdg
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:5/end_device-0:5/target0:0:5/0:0:5:0/block/sdh
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdi -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:6/end_device-0:6/target0:0:6/0:0:6:0/block/sdi
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdj -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdj
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdk -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:1/end_device-7:1/target7:0:1/7:0:1:0/block/sdk
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdl -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:2/end_device-7:2/target7:0:2/7:0:2:0/block/sdl
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdm -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:3/end_device-7:3/target7:0:3/7:0:3:0/block/sdm
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdn -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:4/end_device-7:4/target7:0:4/7:0:4:0/block/sdn

$ lspci -tvnn
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller [8086:0c08]
           +-01.0-[01]----00.0  Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI [8086:8c31]
           +-16.0  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 [8086:8c3a]
           +-16.1  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #2 [8086:8c3b]
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 [8086:8c2d]
           +-1c.0-[02-03]----00.0-[03]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000]
           +-1c.2-[04]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.3-[05]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.4-[06]----00.0  Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
           +-1c.6-[07]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.7-[08]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1d.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 [8086:8c26]
           +-1f.0  Intel Corporation C224 Series Chipset Family Server Standard SKU LPC Controller [8086:8c54]
           +-1f.2  Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] [8086:8c02]
           +-1f.3  Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller [8086:8c22]
           \-1f.6  Intel Corporation 8 Series Chipset Family Thermal Management Controller [8086:8c24]

Dec 07 '23 04:12 delan

Are you getting any errors in dmesg from the controller when this happens? Because if this is that bug, then I would expect the controller or disks to be complaining a lot.

Dec 07 '23 12:12 rincebrain

No controller or SCSI or block errors, only zio errors: dmesg.txt

Dec 08 '23 01:12 delan

That seems like it's coming back from LUKS without going out to the disk, then, at which point, it sounds like #15533, I guess.

Dec 08 '23 03:12 rincebrain

The only code in dm-crypt.c that returns EIO is two instances of:

	/* Reject unexpected unaligned bio. */
	if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
		return -EIO;

While not a smoking gun, that's a good enough reason to believe it is indeed #15533, hoped to be solve by #15588.

Dec 08 '23 06:12 robn

I figured downgrading zfs to 2.1.x would help buy me some time until #15588 lands, but surprisingly I still see this issue going as far back as zfs 2.1.9.

distro	linux	zfs{,-kmod}-	affected?
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/5de0b32be6e85dc1a9404c75131316e4ffbc634c)	6.1.65	2.2.2-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328)	6.1.69	2.2.2-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328)	6.1.69	2.1.14-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/NixOS/nixpkgs/commit/2447a25f908c17964f224c839b5a5f6814265b6b)	6.1.64	2.1.13-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/delan/nixpkgs/commit/970c5fb08bc7d137c74418ac04d386f3d77c9602)	6.1.69	2.1.12-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/delan/nixpkgs/commit/70f148dd685f4e49d9e5d7176f38be8fe5702d9c)	6.1.69	2.1.9-1	yes

I suspect the upgrade from NixOS 23.05 has changed some confounding variable like the kernel version or downstream patches. I’ll test with other distro versions and maybe even other distros.

— 2023-12-26

distro	linux	zfs{,-kmod}-	affected?
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/7790e078f8979a9fcd543f9a47427eeaba38f268)	6.1.69	2.1.14-1	yes
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/18784aac1013da9b442adf29b6c7c228518b5d3f)	6.1.44	2.1.12-1	yes

Both the tip of NixOS 23.05 and the commit I was running before the upgrade in step 2 are affected. Not really contradictory, because I didn’t start receiving backups until step 8, but smells weird to me given step 1 also involved receiving snapshots without any problems.

— 2023-12-28

Dec 26 '23 06:12 delan

I wanted to eliminate the possibility of a problem with pool “ocean” specifically, like the pool being somehow tainted by my testing, since that seemed more likely than the idea that #15533 affects zfs 2.1.x but no one before me knew.

In the end I wasn’t able to reproduce the problem on any test pool with the spare disks I had nearby, but I did discover a workaround that will at least help me get my backups going again in the meantime.

Results of non-incremental sends:

to ocean — failed (3/3 attempts)
to new zpool 186G — good (1/1 attempts)
to new zpool mirror 465G 465G — good (1/1 attempts)
to new zpool 186G via 4K loop beneath luks2 — good (1/1 attempts)
to ocean via 512 and 4K loops beneath luks2 — failed (1/1 attempts)
to ocean via 4K loop atop luks2 — good (1/1 attempts)
to new zpool 186G with luks2 forced to 4096 — good (1/1 attempts)
(block sizes)

In short, you can work around the problem by sticking a losetup -f between the zpool and luks2. The loop device seems to be more forgiving of unaligned writes.

$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo cryptsetup open -d path/to/passphrase {/dev/disk/by-partlabel/,}$i; done

$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo losetup -f --show /dev/mapper/$i; done

$ set --; for i in /dev/loop{0..12}; do set -- "$@" -d $i; done; echo "$@"
-d /dev/loop0 -d /dev/loop1 -d /dev/loop2 -d /dev/loop3 -d /dev/loop4 -d /dev/loop5 -d /dev/loop6 -d /dev/loop7 -d /dev/loop8 -d /dev/loop9 -d /dev/loop10 -d /dev/loop11 -d /dev/loop12

$ sudo zpool import "$@" ocean

More testing to reproduce the problem on another pool is possible, but for now I need a break :)

Dec 28 '23 08:12 delan

FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0 in your zfs module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.

May 03 '24 02:05 robn

Thanks! I’ll test 2.2.4 and/or zfs_vdev_disk_classic=0 and let you know how I go :D

May 03 '24 02:05 delan

No luck on zfs 2.2.4 alone:

Did you mean zfs_vdev_disk_classic=1? 0 is the default and means new, right?

May 10 '24 15:05 delan

@delan you'll need to set zfs_vdev_disk_classic=0 to test this. 2.2.4 defaults to 1 which means continue to use the previous code.

May 10 '24 23:05 behlendorf

@delan see your installed man 4 zfs for more info, or eebf00be.

May 10 '24 23:05 robn

Thanks! So far, I’m unable to reproduce the original failure — CKSUM and WRITE errors when receiving snapshots — under zfs 2.2.4 with zfs_vdev_disk_classic=0, or even when I go back to the default zfs_vdev_disk_classic=1!

Last night was a bit more complicated, because although I received a few snapshots, I upgraded in the middle of rearranging and expanding my pool. Here’s the full timeline:

zpool history ocean | rg '^2024-05-(09|10|11)'
journalctl -S '2024-04-12 11:08:07' -t zed -t kernel
- | rg ' kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
- | rg ' kernel: zio '
May 09 19:56:47, booted zfs 2.2.3 with losetup workaround
May 09 20:19:04, zpool attach -s ocean loop109(ocean4x0) loop113(ocean4x2)
May 09 20:47:10, booted zfs 2.2.4 without losetup workaround, during the resilver phase
May 10 08:54:33, zed class=resilver_finish pool='ocean', class=scrub_start pool='ocean'
May 10 15:26:10, several zfs recv from one machine, no errors(!)
May 10 15:29:39, zpool detach ocean ocean4x1
May 10 15:45:36, zpool add ocean mirror mapper/ocean5x0 mapper/ocean5x1
May 10 19:39:24, first zio error pool=ocean vdev=/dev/mapper/ocean4x0 error=5 type=1
May 10 20:03:46, last zio error pool=ocean vdev=/dev/mapper/ocean0x0 error=5 type=1
May 10 21:00:02, several zfs recv from three machines, no errors(!)
May 10 22:00:03, several zfs recv from one machine, no errors(!)
May 11 00:01:36, zed class=scrub_finish pool='ocean'

Interestingly (to me), the errors didn’t happen anywhere near when I received snapshots, only in a half-hour window near the end of the scrub phase of the zpool attach -s. And this time, there were read errors, not just write errors.

May 11 '24 04:05 delan

I can reproduce the new failure: READ/WRITE/CKSUM errors when scrubbing, where all of the read and write errors are EIO. Haven’t tried zfs_vdev_disk_classic=1 yet. Should I create a separate issue for that and close this one?

May 13 '24 02:05 delan

READ/WRITE/CKSUM errors reproducible when scrubbing, where all of the read and write errors are EIO:

$ journalctl -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'

Should I create a separate issue for that and close this one, since receiving snapshots seems to work for me now?

May 14 '24 03:05 delan

I'm content to keep it here, since we're already here and I'm not sure that it's unrelated yet.

So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.

I assume there's nothing more useful in the kernel log, like errors from intermediate layers?

Does your trick of inserting a loop device still work?

Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from lsblk -t and zdb -C -e ocean. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.

(If the loop does help, then that output without and with would be great too!).

Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.

May 14 '24 11:05 robn

So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.

I accidentally included the zfs_vdev_disk_classic=0 logs (May 13 10:31:51 onwards) in the zfs_vdev_disk_classic=1 logs above, so they may have been a bit misleading. With the logs fixed (zfs 2.2.4, zfs_vdev_disk_classic=0):

$ < scrub.classic.txt rg -o 'offset=[^ ]+' | sort -u > offsets.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+' | sort -u > offsets.new.txt
$ < scrub.classic.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.new.txt

$ comm -23 offsets.classic.txt offsets.new.txt | wc -l
1021    (offsets found in classic only)
$ comm -12 offsets.classic.txt offsets.new.txt | wc -l
48      (offsets found in both)
$ comm -13 offsets.classic.txt offsets.new.txt | wc -l
0       (offsets found in new only)

$ comm -23 identical.classic.txt identical.new.txt | wc -l
1112    ((offset,size,flag)s found in classic only)
$ comm -12 identical.classic.txt identical.new.txt | wc -l
50      ((offset,size,flag)s found in both)
$ comm -13 identical.classic.txt identical.new.txt | wc -l
46      ((offset,size,flag)s found in new only)

I assume there's nothing more useful in the kernel log, like errors from intermediate layers?

None from what I can see. Here are my unfiltered zed and kernel logs (journalctl -t zed -t kernel), though like the logs above, the “classic” logs include some surrounding boots for resilvering and changing settings:

Does your trick of inserting a loop device still work?

Testing that now with zfs_vdev_disk_classic=0, I’ll keep you posted.

Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from lsblk -t and zdb -C -e ocean. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.

zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:

zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l
0000:01:00.0 Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
- disks for one half of each mirror vdev (ocean0x0, ocean1x0, ocean2x0, ocean3x0, ocean4x0, ocean5x0)
- disks are 4096 min-io, 4096 phy-sec, 512 log-sec
  - gpt partitions have offset 40×512 = 20480 (4TB disks) or 2048×512 = 1048576 (8TB disks)
    - luks devs have block size 4096, offset 32768×512 = 16777216
      - loop devs have block size 512, offset 0
0000:06:00.0 Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
- disks for other half of each mirror vdev (ocean0x1, ocean1x1, ocean2x2, ocean3x1, ocean4x2, ocean5x1)
- layout as above
0000:00:1f.2 Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] [8086:8c02]
- disks for special vdevs (oceanSx0, oceanSx1)
- disks are 512 min-io, 512 phy-sec, 512 log-sec
  - gpt partitions have different offsets, but MiB-aligned
    - luks devs have block size 512, offset 32768×512 = 16777216
      - loop devs have block size 512, offset 0

Details for each layer (zfs-15646-tables.sh)

leaf	loop dev	losetup “log-sec”	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “size”
ocean0x0	loop100	512	512	512	512	4000770211840
ocean1x0	loop102	512	512	512	512	4000770211840
ocean2x0	loop104	512	512	512	512	4000770211840
ocean3x0	loop106	512	512	512	512	4000770211840
ocean4x0	loop108	512	512	512	512	8001545043968
ocean5x0	loop110	512	512	512	512	8001545043968
ocean0x1	loop101	512	512	512	512	4000770211840
ocean1x1	loop103	512	512	512	512	4000770211840
ocean2x2	loop105	512	512	512	512	4000770211840
ocean3x1	loop107	512	512	512	512	4000770211840
ocean4x2	loop109	512	512	512	512	8001545043968
ocean5x1	loop111	512	512	512	512	8001545043968
oceanSx0	loop112	512	512	512	512	77292634112
oceanSx1	loop113	512	512	512	512	77292634112

leaf	luks dev	cryptsetup “sector size”	cryptsetup “offset”	cryptsetup “size”
ocean0x0	dm-3	4096	32768	7814004320
ocean1x0	dm-5	4096	32768	7814004320
ocean2x0	dm-7	4096	32768	7814004320
ocean3x0	dm-9	4096	32768	7814004320
ocean4x0	dm-11	4096	32768	15628017664
ocean5x0	dm-13	4096	32768	15628017664
ocean0x1	dm-4	4096	32768	7814004320
ocean1x1	dm-6	4096	32768	7814004320
ocean2x2	dm-8	4096	32768	7814004320
ocean3x1	dm-10	4096	32768	7814004320
ocean4x2	dm-12	4096	32768	15628017664
ocean5x1	dm-14	4096	32768	15628017664
oceanSx0	dm-15	512	32768	150962176
oceanSx1	dm-16	512	32768	150962176

leaf	partition dev	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “start”	lsblk “size”
ocean0x0	sdf1	4096	4096	512	40	4000786989056
ocean1x0	sde1	4096	4096	512	40	4000786989056
ocean2x0	sdc1	4096	4096	512	40	4000786989056
ocean3x0	sdd1	4096	4096	512	40	4000786989056
ocean4x0	sdj1	4096	4096	512	2048	8001561821184
ocean5x0	sdi1	4096	4096	512	2048	8001561821184
ocean0x1	sdp1	4096	4096	512	40	4000786989056
ocean1x1	sdn1	4096	4096	512	40	4000786989056
ocean2x2	sdo1	4096	4096	512	40	4000786989056
ocean3x1	sdm1	4096	4096	512	40	4000786989056
ocean4x2	sdl1	4096	4096	512	2048	8001561821184
ocean5x1	sdk1	4096	4096	512	2048	8001561821184
oceanSx0	sda2	512	512	512	2099200	77309411328
oceanSx1	sdb1	512	512	512	2048	77309411328

leaf	disk dev	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “size”
ocean0x0	sdf	4096	4096	512	4000787030016
ocean1x0	sde	4096	4096	512	4000787030016
ocean2x0	sdc	4096	4096	512	4000787030016
ocean3x0	sdd	4096	4096	512	4000787030016
ocean4x0	sdj	4096	4096	512	8001563222016
ocean5x0	sdi	4096	4096	512	8001563222016
ocean0x1	sdp	4096	4096	512	4000787030016
ocean1x1	sdn	4096	4096	512	4000787030016
ocean2x2	sdo	4096	4096	512	4000787030016
ocean3x1	sdm	4096	4096	512	4000787030016
ocean4x2	sdl	4096	4096	512	8001563222016
ocean5x1	sdk	4096	4096	512	8001563222016
oceanSx0	sda	512	512	512	480103981056
oceanSx1	sdb	512	512	512	240057409536

leaf	disk dev	disk path	driver	controller
ocean0x0	sdf	pci-0000:01:00.0-sas-phy3-lun-0	mpt3sas	x16 slot
ocean1x0	sde	pci-0000:01:00.0-sas-phy2-lun-0	mpt3sas	x16 slot
ocean2x0	sdc	pci-0000:01:00.0-sas-phy1-lun-0	mpt3sas	x16 slot
ocean3x0	sdd	pci-0000:01:00.0-sas-phy0-lun-0	mpt3sas	x16 slot
ocean4x0	sdj	pci-0000:01:00.0-sas-phy7-lun-0	mpt3sas	x16 slot
ocean5x0	sdi	pci-0000:01:00.0-sas-phy6-lun-0	mpt3sas	x16 slot
ocean0x1	sdp	pci-0000:06:00.0-sas-phy7-lun-0	mpt3sas	x2 slot
ocean1x1	sdn	pci-0000:06:00.0-sas-phy6-lun-0	mpt3sas	x2 slot
ocean2x2	sdo	pci-0000:06:00.0-sas-phy5-lun-0	mpt3sas	x2 slot
ocean3x1	sdm	pci-0000:06:00.0-sas-phy4-lun-0	mpt3sas	x2 slot
ocean4x2	sdl	pci-0000:06:00.0-sas-phy3-lun-0	mpt3sas	x2 slot
ocean5x1	sdk	pci-0000:06:00.0-sas-phy2-lun-0	mpt3sas	x2 slot
oceanSx0	sda	pci-0000:00:1f.2-ata-3.0	ahci	onboard
oceanSx1	sdb	pci-0000:00:1f.2-ata-2.0	ahci	onboard

Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.

No worries! I really appreciate the time you’ve spent investigating this.

May 15 '24 07:05 delan

No errors when scrubbing in zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:

journalctl -t zed -t kernel

$ journalctl -b -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
May 15 01:44:12 venus kernel: Linux version 6.1.90 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Thu May  2 14:29:32 UTC 2024
May 15 01:44:14 venus zed[4231]: ZFS Event Daemon 2.2.4-1 (PID 4231)
May 15 01:44:14 venus zed[4335]: eid=13 class=resilver_start pool='ocean'
May 15 01:46:00 venus zed[7907]: eid=16 class=resilver_finish pool='ocean'
May 15 10:54:34 venus zed[67152]: eid=18 class=scrub_start pool='ocean'
May 16 03:00:35 venus zed[1724077]: eid=122 class=scrub_finish pool='ocean'

(If the loop does help, then that output without and with would be great too!).

zfs 2.2.4, zfs_vdev_disk_classic=0, without losetup workaround:

zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l is empty
layout is the same as in this comment, minus loop devs

Details for each layer (zfs-15646-tables.sh)

leaf	loop dev	losetup “log-sec”	losetup “offset”	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “size”

leaf	luks dev	cryptsetup “sector size”	cryptsetup “offset”	cryptsetup “size”
ocean0x0	dm-3	4096	32768	7814004320
ocean1x0	dm-5	4096	32768	7814004320
ocean2x0	dm-7	4096	32768	7814004320
ocean3x0	dm-9	4096	32768	7814004320
ocean4x0	dm-11	4096	32768	15628017664
ocean5x0	dm-13	4096	32768	15628017664
ocean0x1	dm-4	4096	32768	7814004320
ocean1x1	dm-6	4096	32768	7814004320
ocean2x2	dm-8	4096	32768	7814004320
ocean3x1	dm-10	4096	32768	7814004320
ocean4x2	dm-12	4096	32768	15628017664
ocean5x1	dm-14	4096	32768	15628017664
oceanSx0	dm-15	512	32768	150962176
oceanSx1	dm-16	512	32768	150962176

leaf	partition dev	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “start”	lsblk “size”
ocean0x0	sde1	4096	4096	512	40	4000786989056
ocean1x0	sdf1	4096	4096	512	40	4000786989056
ocean2x0	sdc1	4096	4096	512	40	4000786989056
ocean3x0	sdd1	4096	4096	512	40	4000786989056
ocean4x0	sdj1	4096	4096	512	2048	8001561821184
ocean5x0	sdi1	4096	4096	512	2048	8001561821184
ocean0x1	sdp1	4096	4096	512	40	4000786989056
ocean1x1	sdn1	4096	4096	512	40	4000786989056
ocean2x2	sdo1	4096	4096	512	40	4000786989056
ocean3x1	sdm1	4096	4096	512	40	4000786989056
ocean4x2	sdl1	4096	4096	512	2048	8001561821184
ocean5x1	sdk1	4096	4096	512	2048	8001561821184
oceanSx0	sda2	512	512	512	2099200	77309411328
oceanSx1	sdb1	512	512	512	2048	77309411328

leaf	disk dev	lsblk “min-io”	lsblk “phy-sec”	lsblk “log-sec”	lsblk “size”
ocean0x0	sde	4096	4096	512	4000787030016
ocean1x0	sdf	4096	4096	512	4000787030016
ocean2x0	sdc	4096	4096	512	4000787030016
ocean3x0	sdd	4096	4096	512	4000787030016
ocean4x0	sdj	4096	4096	512	8001563222016
ocean5x0	sdi	4096	4096	512	8001563222016
ocean0x1	sdp	4096	4096	512	4000787030016
ocean1x1	sdn	4096	4096	512	4000787030016
ocean2x2	sdo	4096	4096	512	4000787030016
ocean3x1	sdm	4096	4096	512	4000787030016
ocean4x2	sdl	4096	4096	512	8001563222016
ocean5x1	sdk	4096	4096	512	8001563222016
oceanSx0	sda	512	512	512	480103981056
oceanSx1	sdb	512	512	512	240057409536

leaf	disk dev	disk path	driver	controller
ocean0x0	sde	pci-0000:01:00.0-sas-phy3-lun-0	mpt3sas	x16 slot
ocean1x0	sdf	pci-0000:01:00.0-sas-phy2-lun-0	mpt3sas	x16 slot
ocean2x0	sdc	pci-0000:01:00.0-sas-phy1-lun-0	mpt3sas	x16 slot
ocean3x0	sdd	pci-0000:01:00.0-sas-phy0-lun-0	mpt3sas	x16 slot
ocean4x0	sdj	pci-0000:01:00.0-sas-phy7-lun-0	mpt3sas	x16 slot
ocean5x0	sdi	pci-0000:01:00.0-sas-phy6-lun-0	mpt3sas	x16 slot
ocean0x1	sdp	pci-0000:06:00.0-sas-phy7-lun-0	mpt3sas	x2 slot
ocean1x1	sdn	pci-0000:06:00.0-sas-phy6-lun-0	mpt3sas	x2 slot
ocean2x2	sdo	pci-0000:06:00.0-sas-phy5-lun-0	mpt3sas	x2 slot
ocean3x1	sdm	pci-0000:06:00.0-sas-phy4-lun-0	mpt3sas	x2 slot
ocean4x2	sdl	pci-0000:06:00.0-sas-phy3-lun-0	mpt3sas	x2 slot
ocean5x1	sdk	pci-0000:06:00.0-sas-phy2-lun-0	mpt3sas	x2 slot
oceanSx0	sda	pci-0000:00:1f.2-ata-3.0	ahci	onboard
oceanSx1	sdb	pci-0000:00:1f.2-ata-2.0	ahci	onboard

May 16 '24 04:05 delan

I can now reproduce the problem immediately, rather than having to wait >10 hours each time, by patching zfs to only scrub a dataset known to be especially vulnerable to this bug. Anything else I can do to help, let me know :)

May 17 '24 04:05 delan

Thanks for all the extra info.

I wasn't able to reproduce this, but also I couldn't exactly reproduce the structure either, because I don't have any drives that report physical 4096 / logical 512, and I couldn't persuade qemu to invent one that also worked with the partition offsets.

I don't know if that is significant, but changing block sizes in the stack causing IOs at one level to not have the right shape at another level feels like it might be on the right track. So I suspect something around that. I did a bit of a code read through the layers but without having a good sense of what the input looks like, it's hard to guess which way the branches fall out.

The next step is likely to spend some time with blktrace/blkparse, looking for weird split/remap/merge ops on those failing IOs as they move up and down the stack. Possibly we'll also need something to show the page structure within the IO, which I think means an OpenZFS, unless we get really good at bpftrace on hardcore mode. Probably there's other/better ways, but I don't know them.

I'll work on getting a patch done and some commands to run as soon as I can. However, I'm hopping on a plane tomorrow and won't be very consistently available for the next couple of weeks. I'll try my best but it might be a little while before I have anything for you. (I mean, no warranty and all that, but still, there's a real problem here somewhere and you've been very patient and I don't wanna take that for granted :) ).

May 21 '24 05:05 robn

Hi, any updates on this? I ended up having to detach and reattach a few of the disks to get the resilvering to finish after my last round of tests, but now I should be ready for further testing.

Jun 18 '24 06:06 delan

Hi, sorry, I was away for a bit longer than expected, and am only just starting to get back onto this.

Right now I don't have anything else to give you. I'm part way through setting up a test to build semi-random DM stacks with different block sizes and then running various known-stressful workloads against them, trying to find a scenario that reproduces the problem that I can then study. I'm pretty swamped on dayjob and home stuff, but I'm trying to push it along a little each evening. I would hope to be running actual tests sometime on the weekend, but I'm not sure - lot going on with not-computers atm :grimacing:

I'll do my best to let you know something in the next week or so, hopefully more than "I didn't get to it yet". Sorry, and thanks again!

Jun 20 '24 00:06 robn

So I do have some small news on this, but no solution yet. There's been a tiny number (<5) of reports, and all have (a) LUKS2 and (b) mixed 512/4096 block sizes somewhere in the stack. Unfortunately I've not been able to reproduce it at all.

So if you're still interested in testing, I have a debugging patch: robn/zfs@7a97f21. This should apply cleanly on top of 2.2.4, 2.2.5 and master.

Enable new BIO submission with zfs_vdev_disk_classic=0, and then this patch with zfs_vdev_disk_debug_bio_fill=1. Then, when the kernel rejects an IO and emits the zio pool=... error messages to the kernel log, it'll also log the structure of what it submitted to the kernel. Hopefully, this will help me understand what's happening without having to dig into the kernel itself.

The output might look something like this:

[   17.222249] zio pool=tank vdev=/dev/mapper/errblk error=5 type=2 offset=42078208 size=16384 flags=1074267264
[   17.222396] vbio debug: vd=3 ms=80 mb=140000 bs=1000 ty=2 sz=4000 bt=G 1[^1]1[1]1[1]1[1]$

Looks like line noise, but means a lot to me.

When zfs_vdev_disk_debug_bio_fill=0, there should be basically no overhead. When its enabled, there's a tiny bit of extra overhead for every IO (~256 bytes of memory and a tiny bit of code to put things in it) that gets thrown away if there isn't an error, but it shouldn't really be noticeable.

Incidentally, zfs_vdev_disk_debug_bio_fill can be changed at runtime through /sys/module/zfs/parameters/zfs_vdev_disk_debug_bio_fill. Also if you want to test it, you can set this to 2, and then it'll show the IO structure for successful IOs too. This will obliterate your kernel log and trash your performance, but you'll know if its on :grimacing:.

Usual caveats about experimental code, use at your own risk, etc. It should be fine but I don't wanna be silly about it.

If you get a hit, let me know! And hopefully it'll show something extremely obviously stupid on my part.

(If it ends up helping, I may try to get it included upstream. Live debugging tools are so incredibly useful).

Thanks!

Aug 12 '24 06:08 robn

Probably also affected by ZFS on top of LUKS2 ...

Difficult to know if the same thing but I have already been bitten on one system (Crucial MX500 500GB FW 022 or 023) a few Months ago or so. Both Drives affected. Only Solution was to reformat everything (tried to change PSU, Backplane, Controller, MB, entire System, etc). Then it worked normally as a boot Drive into another System (< 10% space Used).

And 2 Days ago I have been bitten again this Time of a pair of Crucial MX500 1000GB FW 046 (upgraded from FW 042/043 just in order to PREVENT further Issues).

Crucial MX500 Reports Mixed block size, is this what you refer to, @robn ?

GNU Parted 3.6
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) u MiB                                                            
(parted) p                                                                
Model: ATA CT1000MX500SSD1 (scsi)
Disk /dev/sda: 953870MiB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start    End        Size       File system  Name        Flags
 1      1,00MiB  32,0MiB    31,0MiB                 DISK1_BIOS  bios_grub
 2      32,0MiB  544MiB     512MiB     fat32        DISK1_EFI   boot, esp
 3      544MiB   2592MiB    2048MiB    ext4         DISK1_BOOT
 4      2592MiB  953614MiB  951022MiB               DISK1_ROOT

Initially only one drive started throwing out Errors. I switched from the onboard SATA AHCI Controller from Supermicro X11SSL-F to a LSI 9217-4i4e Cross-Flashed into IT Mode (P20 Firmware + BIOS Boot ROM).

After the Switch, a boot into LiveUSB worked just fine. The Pool imported normally and just did a quick Resilver on the (initially single) Drive that was affected. Scrub was performed and no Issues were reported.

Tried to do apt dist-upgrade to see if a newer Kernel would fix the Issue (errors would start being fired at the Console as soon as Proxmox VE Finished Booting up). At 40% it started hanging, then I checked dmesg and now BOTH Drives are trowing loads of IO Errors.

Actually the SAME 2 blocks/sectors seems to be affected (total of 4 Combinations for flags and offset):

[12525.816835] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E605725B_crypt error=5 type=2 offset=623386214400 size=32768 flags=1572992
[12525.816837] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E605725B_crypt error=5 type=2 offset=623385092096 size=32768 flags=1572992
[12525.817316] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E6057147_crypt error=5 type=2 offset=623385092096 size=32768 flags=1605761
[12525.817331] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E6057147_crypt error=5 type=2 offset=623386214400 size=32768 flags=1572992
[12525.817410] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E605725B_crypt error=5 type=2 offset=623385092096 size=32768 flags=1605761
[12525.817713] zio pool=rpool vdev=/dev/mapper/ata-CT1000MX500SSD1_2205E6057147_crypt error=5 type=2 offset=623386214400 size=32768 flags=1605761

Meanwhile iostat showed a LOT of write IOPS (or rather ATTEMPTED write IOPS, ~ 15'000 w/s) on the Drives as well as some read IOPS. Lots of MB/s (cannot remember if 300 MB/s write + 150 MB/s read or 150 MB/s write + 300 MB/s read) in both Directions actually. I did NOT save that unfortunately :disappointed:.

EDIT 1: My Previous as well as this one Experience might suggest that the Issue typically shows up after some Time (Previous was maybe 2-3 Years, this one around 1 Year), with MOSTLY Full Pools (90% or so), around 100 TBW or so (this Week's SSDs are both at 94 TBW).

EDIT 2: LUKS Dump shows Sector Size of 4k (excerpt from another System, but it's the same Crucial SSD).

Data segments:
  0: crypt
	offset: 16777216 [bytes]
	length: (whole device)
	cipher: aes-xts-plain64
	sector: 4096 [bytes]

EDIT 3: zpool get ashift rpool (excerpt from another System, but it's the same Crucial SSD).

NAME   PROPERTY  VALUE   SOURCE
rpool  ashift    12      local

Aug 13 '24 17:08 luckylinux

@luckylinux thanks for taking the time to post. I need to know the kernel version, ZFS version, and if you're changing zfs_vdev_disk_classic at all (2.2.4+).

Aug 13 '24 20:08 robn

I know this issue is being discussed in some other places, so for any new people turning up: Hello!

This is an complex issue that we don't fully understand yet. It's certainly not as simple as "LUKS on 512/4096 devices". The tiny number of reports I've seen have involved LUKS2, but also more complex block device constructions involving mixed block sizes. It also may involve pools nearing capacity or highly fragmented, and may involve large aggregate write sizes. There's a lot of variables, I don't know them all, and I have been trying for months to reproduce it in the lab without success.

Its also unlikely to be new. This problem is likely to have existed in some form since at least OpenZFS 2.0.0, and possibly back to the start of the ZFS-on-Linux project.

All this is to say, if you've been wondering if you're affected by this but you haven't seen any IO errors or you're not in an unusual situation, then you probably have nothing to worry about. No guarantees, as always, but this is a pretty niche case.

If you do think you're hitting this, then I definitely need your help! I need kernel version, OS version, OpenZFS version, a description of your drives, block device and pool topology, capacity, fragmentation, and workload, and whether or not changing zfs_vdev_disk_classic makes a difference. See @delan's outstanding work nearer the top of this issue; this is the level of detail this needs.

If you can reliably reproduce it with zfs_vdev_disk_classic=0, and you have the ability and inclination to run a patch, then you might try this patch, which will give more information about how IO is being handed off to the kernel.

Thanks for your interest!

Aug 13 '24 21:08 robn

So if you're still interested in testing, I have a debugging patch: robn/zfs@7a97f21. This should apply cleanly on top of 2.2.4, 2.2.5 and master.

Enable new BIO submission with zfs_vdev_disk_classic=0, and then this patch with zfs_vdev_disk_debug_bio_fill=1. Then, when the kernel rejects an IO and emits the zio pool=... error messages to the kernel log, it'll also log the structure of what it submitted to the kernel. Hopefully, this will help me understand what's happening without having to dig into the kernel itself.

dmesg with read and write errors under:

zfs 2.2.4
plus zfs_vdev_disk_debug_bio_fill.patch (robn/zfs@7a97f21)
plus ds3.patch (limits scrub to one dataset)
zfs_vdev_disk_classic=0
zfs_vdev_disk_debug_bio_fill=1
without losetup workaround

Aug 14 '24 02:08 delan

@robn I'll try to provide more Info here and answer your Questions.

In my case it's more difficult to provide Detailed logs from the "Real" System (Proxmox VE) because this is my Root Pool. Thus, when the BUG/Issue hits, it basically freezes the entire System. I cannot SSH into it anymore. And of course the Logs will NOT be written to file.

I don't have netconsole setup right now. Heck not even automatic Backups (too many Issues popping up one after the other, so I have to firefight instead of doing what really needs to be done).

Common:

Motherboard: Supermicro X11SSL-F with BIOS 3.3
CPU: Intel Xeon E3 1240 v5
RAM: 64GB DDR4 Unbuffered ECC RAM 2666MHz running at 2133 MHz
Onboard SATA Controller Used (White SATA Ports)
Alternative test using LSI 9217-4i4e Cross-flashed to IT Model P20 Firmware + BIOS Boot ROM (which is the Current Configuration)
Disks: Crucial MX500 1000GB from 2022, FW 046 (upgraded from 042 or 043 in order to Prevent some known BUGs)

Real System

Arch: AMD64
OS: Proxmox VE on top of Debian GNU/Linux 12 Bookworm
Kernel: probably proxmox-kernel-6.8.8-3-pve-signed (if I did NOT reboot after the last System Update) or proxmox-kernel-6.8.12-1-pve-signed although it's NOT marked as installed even though /boot/vmlinuz-6.8.12-1-pve exists (if I DID reboot after the last System Update)
ZFS: 2.2.4 from Proxmox Repository (pve-no-subscription)

zfs-2.2.4-pve1
zfs-kmod-2.2.4-pve1

LiveUSB System:

Arch: AMD64
OS: Debian GNU/Linux 12 Bookworm AMD with Kernel+ZFS from Backports Repository
Kernel: linux-image-6.9.7+bpo-amd64 or linux-image-6.7.12+bpo-amd64, I will try to always boot with linux-image-6.7.12+bpo-amd64 in order to keep a "compliant" Kernel "officially" supported by ZFS (<= Kernel 6.8.x)
ZFS: 2.2.4 from Debian Backports

zfs-2.2.4-1~bpo12+1
zfs-kmod-2.2.4-1~bpo12+1

Trying to import after the last write IO barrage / Issue / BUG didn't go smoothly, but I managed to get it imported on LiveUSB like this (had to force it with -F):

root@LiveTools1:/tools_local/linux-setup-pve15# zpool import -f rpool -o readonly=off -R /mnt/debian
cannot import 'rpool': one or more devices is currently unavailable
root@LiveTools1:/tools_local/linux-setup-pve15# zpool status -v
no pools available
root@LiveTools1:/tools_local/linux-setup-pve15# zpool import -F -f rpool -o readonly=off -R /mnt/debian
root@LiveTools1:/tools_local/linux-setup-pve15# zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 01:10:18 with 0 errors on Tue Aug 13 13:24:11 2024
config:

	NAME                                        STATE     READ WRITE CKSUM
	rpool                                       ONLINE       0     0     0
	  mirror-0                                  ONLINE       0     0     0
	    ata-CT1000MX500SSD1_2205E6057147_crypt  ONLINE       0     0     0
	    ata-CT1000MX500SSD1_2205E605725B_crypt  ONLINE       0     0     0

errors: No known data errors

(I did NOT redo a Scrub again ...)

root@LiveTools1:/tools_local/linux-setup-pve15# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   920G   857G  63.0G        -         -    89%    93%  1.00x    ONLINE  /mnt/debian

dmesg.txt

I'm NOT using any setting YET for zfs_vdev_disk_classic (I will try soon based on PMT's Suggestion over IRC).

EDIT 1: on 2024-08-11, Immediately after the System Failure happened (during the night) and while trying to assess using the LiveUSB, I also saved some zpool events -v -f rpool Logs: zfs_events_20240811.txt. Buy this is AFTER a Reboot from the Real System into the LiveUSB System, so the Checksum Errors might just indicate some Analysis of the Damage done by the BUG.

EDIT 2: based on the Conversations on IRC it seems it's NOT suggested to change the defaults and set options zfs vdev_disk_classic=0 in e.g. /etc/modprobe.d/zfs-luks.conf. So I will NOT do that.

Aug 14 '24 04:08 luckylinux

@delan this is fantastic, thank you! It showed me some things that I am quite surprised by, and suggest I didn't understand some of this stuff as well as completely as I thought I did.

Here are some new patches. The same patch doesn't apply cleanly to different versions, so I've made some variants:

master: robn/zfs@228b2bbdc
2.2.5: robn/zfs@bb4384363
2.2.4: robn/zfs@755f6e817

This patch will produce slightly different output with zfs_vdev_disk_debug_bio_fill=1, looking more like these:

[   19.271965] vbio debug: ms=80 mb=140000 bs=200 ty=2 sz=200 bt=L 1[C^1]$
[   19.273256] vbio debug: ms=80 mb=140000 bs=200 ty=2 sz=aa00 bt=L 5[^5]8[8]8[8]8[8]8[8]8[8]8[8]8[8]8[8]8[8]8[8]$
[   22.838168] vbio debug: ms=80 mb=140000 bs=200 ty=2 sz=ac00 bt=L /56/[^/56/]$

Same test would be great.

If it continues failing, I'd love it if you could test again, adding zfs_abd_page_iter_disable_compound=1 to the options (keep zfs_vdev_disk_debug_bio_fill=1). This is to rule in/out an optimisation codepath as the source of the problem. If it still throws errors, then its not that, or at least, not alone. If it doesn't throw errors, it might be an indicator that that optimisation is buggy (it might also be that we're just unfortunate enough not to trip it). But at least in that case I'll be able to ship a patch to disable it pending further investigation.

Thank you!

Aug 14 '24 10:08 robn

Quick incident report… so yesterday morning I tested the first bio debug patch:

09:33:45, booted zfs 2.2.4, zfs_vdev_disk_debug_bio_fill.patch (robn/zfs@7a97f21), ds3.patch (limits scrub to one dataset), zfs_vdev_disk_classic=0, zfs_vdev_disk_debug_bio_fill=0→1, without losetup workaround
10:04:22, zpool checkpoint ocean
10:06:54, zpool scrub ocean
10:17:15, last zed entry for scrub, after many errors
21:00:01, start of daily backups, receiving snapshots into ocean
21:00:32, snapshot receives cause several errors
journalctl -b -3 for the boot above

But when I rebooted to test the second patch, the pool got into a weird state where the resilver kept flapping, preventing me from scrubbing or exporting. I perhaps made the mistake of reimporting the pool without the losetup workaround, which may have prevented the pool from resilvering properly:

21:15:33, booted zfs 2.2.4, zfs-15646-patch2-for-2.2.4.patch (robn/zfs@755f6e817), ds3.patch (limits scrub to one dataset), zfs_vdev_disk_classic=0, zfs_vdev_disk_debug_bio_fill=0, without losetup workaround
21:15:46, import immediately hits a bunch of checksum errors, kicking off a resilver
21:16:08, resilver inexplicably restarts
21:16:21, resilver inexplicably restarts
21:16:34, resilver inexplicably restarts
[resilver continues to restart over and over]
journalctl -b -2 for the boot above

Suspecting this was because I had been limiting scans to one dataset, I then tweaked my patch for that to take a module param and scan the pool normally by default. This actually made things worse:

21:49:29, booted zfs 2.2.4, zfs-15646-patch2-for-2.2.4.patch (robn/zfs@755f6e817), ds4.patch (limits scrub to one dataset, but only if zfs_scan_only_that_sus_dataset=1), zfs_vdev_disk_classic=0, zfs_vdev_disk_debug_bio_fill=0, without losetup workaround
21:49:43, import kicks off a resilver, yielding many errors at 21:49:59
21:51:34, resilver restarts, immediately yielding many errors
21:53:03, resilver restarts, immediately yielding many errors
21:54:27, resilver restarts, immediately yielding many errors
[resilver continues to restart over and over]
journalctl -b -1 for the boot above

At this point, I had several disks marked as degraded, including both sides of one of the mirror vdevs. Once I could get the pool to export, I was able to break the cycle by rewinding to checkpoint:

22:18:32, booted zfs 2.2.4, zfs-15646-patch2-for-2.2.4.patch (robn/zfs@755f6e817), ds4.patch (limits scrub to one dataset, but only if zfs_scan_only_that_sus_dataset=1), zfs_vdev_disk_classic=0, zfs_vdev_disk_debug_bio_fill=0, no import on boot
23:49:27, import with losetup workaround, immediately yielding several checksum errors
23:51:40, kicks off a resilver
23:52:31, resilver inexplicably restarts
23:53:23, resilver inexplicably restarts
23:54:15, resilver inexplicably restarts
[resilver continues to restart over and over]
00:06:51, zpool clear ocean; resilver continues to restart
00:10:48, zpool export ocean; resilver continues to restart
00:15:26, set zfs_scan_suspend_progress=0→1; pool finally exports
00:21:52, zpool import --rewind-to-checkpoint ocean; all is well
journalctl -b -0 for the boot above

I’ll scrub the whole pool, then I’ll do my next round of testing. Lessons for me:

Reimport with losetup workaround between tests.
Set zfs_vdev_disk_debug_bio_fill=1 on boot, rather than =0→1 at runtime, otherwise you might lose valuable debug output. That said, I probably wouldn’t set =2 on boot, only =1→2 at runtime.

Aug 15 '24 03:08 delan

journalctl -b with read and write errors under:

zfs 2.2.4
plus zfs-15646-patch2-for-2.2.4.patch (https://github.com/robn/zfs/commit/755f6e817)
plus ds4.patch (limits scrub to one dataset, but only if zfs_scan_only_that_sus_dataset=1)
zfs_vdev_disk_classic=0
zfs_vdev_disk_debug_bio_fill=1
zfs_abd_page_iter_disable_compound=0
zfs_scan_only_that_sus_dataset=0
no import on boot

Same test would be great. If it continues failing, I'd love it if you could test again, adding zfs_abd_page_iter_disable_compound=1 to the options (keep zfs_vdev_disk_debug_bio_fill=1).

Testing scrub with zfs_abd_page_iter_disable_compound=0:

14:11:15, import without losetup workaround
14:12:49, set zfs_scan_only_that_sus_dataset=0→1
14:13:13, zpool scrub ocean
14:23:25, last zed entry for scrub, after many errors

Now to repair the pool for more testing. It looks like even reimporting the pool with the losetup workaround gets us into that weird state where the resilver keeps flapping:

14:28:49, set zfs_scan_only_that_sus_dataset=1→0
14:31:34, zpool export ocean
14:32:37, import with losetup workaround
14:32:50, kicks off a resilver
14:33:12, resilver inexplicably restarts
14:33:21, resilver inexplicably restarts
14:33:29, resilver inexplicably restarts
[resilver continues to restart over and over]
14:34:05, set zfs_scan_suspend_progress=0→1
14:34:18, zpool export ocean
14:35:49, zpool import --rewind-to-checkpoint ocean; all is well
14:36:53, zpool checkpoint ocean

Testing scrub with zfs_abd_page_iter_disable_compound=1:

14:37:36, set zfs_abd_page_iter_disable_compound=0→1
14:37:40, set zfs_scan_suspend_progress=1→0
14:38:00, set zfs_scan_only_that_sus_dataset=0→1
14:38:41, zpool export ocean
14:39:10, import without losetup workaround
14:39:40, zpool scrub ocean
14:49:53, last zed entry for scrub, after many errors

Aug 17 '24 06:08 delan

@delan it's doing my head in, but I think I'm on the right track. This log output has been amazingly useful!

I need a tiny bit more info from you, about how your dm-crypt maps are set up. This is the kernel-side crypto stuff that LUKS drives.

The output from dmsetup table would be sufficient. For a dm-crypt device, it looks something like:

# dmsetup table
loop0c: 0 172032 crypt aes-xts-plain64 :64:logon:cryptsetup:d5de04ec-ca52-4172-832b-bff7686c1e54-d0 0 7:0 32768 1 sector_size:4096

For avoidance of doubt, there's no sensitive material here. The "key" field (:64:login:...) is the name of the key in the kernel keyring, which cryptsetup manages. Details are in the dm-crypt docs. If you did want to limit it, I want everything after the key (so 0 7:0 32768 ...).

FWIW, what I'm looking at is that some of the inputs for the cipher parameters are derived from sector numbers, which themselves are derived from data offsets. Most of the kernel block level is pretty forgiving if these aren't specified quite right (as in, it operates on the data you load in to the IO objects more than the metadata you send with it. dm-crypt (rightly) relies on it being accurate though, and I think there's something in the 4K->512 mapping that is screwing up, and its probably because we're setting something wrong.

Now I've got a better idea of where to look, I'll have another go at reproducing it too.

Aug 20 '24 11:08 robn

zfs zfs copied to clipboard

CKSUM and WRITE errors when receiving snapshots or scrubbing (2.2.4, LUKS)

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard