zfs Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption

One of my machines (FC40) recently received a kernel update

from kernel-6.8.11-300.fc40.x86_64 to kernel-6.10.12-200.fc40.x86_64, and
from ZFS+DKMS+Dracut master lineage, from commit 02c5aa9b092818785ed8db4e2246a828278138e3 to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a

simultaneously. This took place earlier today. The pool was healthy, in use, and recently scrubbed multiple times. No error anywhere, in the kernel log, or in the journal.

Mere minutes after I rebooted to the new kernel and ZFS, my Prometheus setup alerted me to 30 checksum errors, several write errors, and 4 data errors. Upon inspection:

[root@penny ~]# zpool status -v
  pool: chest
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
...abbreviated...

	NAME                                                                                                STATE     READ WRITE CKSUM
...abbreviated...

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x16>
        <metadata>:<0x3c>
        <metadata>:<0x44>
        <metadata>:<0x594>

The kernel also didn't show any hardware errors in the kernel ring buffer.

I rebooted to the older kernel and ZFS module, and started a scrub. It's still ongoing, but it has not found any problem nor produced any WRITE or CKSUM errors.

Interestingly, neither the cache nor the ZIL devices had that error. The boot drive also seemed unaffected.

This, to me, indicates a software issue, probably related to the LUKS write path (we've had these before) or mirroring (only the pool with the mirrored drives were hit — the single boot drive was not hit by the problem despite being the same otherwise).

The affected LUKS2 devices are all whole-disk formatted with 4K sector size, and my pool is ashift 12. (The unaffected root pool, cache device and ZIL device for the affected pool are not formatted with LUKS 4K sector size).

In the interest of seeing if it makes a difference, the affected LUKS devices are tuned with the following persistent flags:

Flags:       	allow-discards same-cpu-crypt submit-from-crypt-cpus no-read-workqueue no-write-workqueue

The pool has the following features:

NAME   PROPERTY                       VALUE                          SOURCE
chest  size                           10.9T                          -
chest  capacity                       67%                            -
chest  altroot                        -                              default
chest  health                         ONLINE                         -
chest  guid                           2537396116593781450            -
chest  version                        -                              default
chest  bootfs                         -                              default
chest  delegation                     on                             default
chest  autoreplace                    off                            default
chest  cachefile                      -                              default
chest  failmode                       wait                           default
chest  listsnapshots                  off                            default
chest  autoexpand                     off                            default
chest  dedupratio                     1.00x                          -
chest  free                           3.55T                          -
chest  allocated                      7.34T                          -
chest  readonly                       off                            -
chest  ashift                         12                             local
chest  comment                        -                              default
chest  expandsize                     -                              -
chest  freeing                        0                              -
chest  fragmentation                  9%                             -
chest  leaked                         0                              -
chest  multihost                      off                            default
chest  checkpoint                     -                              -
chest  load_guid                      16604087848420727134           -
chest  autotrim                       off                            default
chest  compatibility                  off                            default
chest  bcloneused                     0                              -
chest  bclonesaved                    0                              -
chest  bcloneratio                    1.00x                          -
chest  feature@async_destroy          enabled                        local
chest  feature@empty_bpobj            active                         local
chest  feature@lz4_compress           active                         local
chest  feature@multi_vdev_crash_dump  enabled                        local
chest  feature@spacemap_histogram     active                         local
chest  feature@enabled_txg            active                         local
chest  feature@hole_birth             active                         local
chest  feature@extensible_dataset     active                         local
chest  feature@embedded_data          active                         local
chest  feature@bookmarks              enabled                        local
chest  feature@filesystem_limits      enabled                        local
chest  feature@large_blocks           enabled                        local
chest  feature@large_dnode            enabled                        local
chest  feature@sha512                 enabled                        local
chest  feature@skein                  enabled                        local
chest  feature@edonr                  enabled                        local
chest  feature@userobj_accounting     active                         local
chest  feature@encryption             enabled                        local
chest  feature@project_quota          active                         local
chest  feature@device_removal         enabled                        local
chest  feature@obsolete_counts        enabled                        local
chest  feature@zpool_checkpoint       enabled                        local
chest  feature@spacemap_v2            active                         local
chest  feature@allocation_classes     enabled                        local
chest  feature@resilver_defer         enabled                        local
chest  feature@bookmark_v2            enabled                        local
chest  feature@redaction_bookmarks    enabled                        local
chest  feature@redacted_datasets      enabled                        local
chest  feature@bookmark_written       enabled                        local
chest  feature@log_spacemap           active                         local
chest  feature@livelist               enabled                        local
chest  feature@device_rebuild         enabled                        local
chest  feature@zstd_compress          enabled                        local
chest  feature@draid                  enabled                        local
chest  feature@zilsaxattr             disabled                       local
chest  feature@head_errlog            disabled                       local
chest  feature@blake3                 disabled                       local
chest  feature@block_cloning          disabled                       local
chest  feature@vdev_zaps_v2           disabled                       local
chest  feature@redaction_list_spill   disabled                       local
chest  feature@raidz_expansion        disabled                       local

This is where my pool currently sits:

[root@penny ~]# zpool status
  pool: chest
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Oct  9 22:37:54 2024
	900G / 7.35T scanned at 950M/s, 136G / 7.35T issued at 143M/s
	0B repaired, 1.80% done, 14:39:54 to go
config:

	NAME                                                                                                STATE     READ WRITE CKSUM
	chest                                                                                               ONLINE       0     0     0
	  mirror-0                                                                                          ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sda  ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdb  ONLINE       0     0     0
	  mirror-3                                                                                          ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdc  ONLINE       0     0     0
	    dm-uuid-CRYPT-LUKS2-sdd  ONLINE       0     0     0
	logs	
	  dm-uuid-CRYPT-LUKS2-sde    ONLINE       0     0     0
	cache
	  dm-uuid-CRYPT-LUKS2-sdf    ONLINE       0     0     0

errors: 4 data errors, use '-v' for a list

Update: got good news! After reverting to the prior kernel + commit mentioned above, I am very happy to report that the scrub found no errors, and the data errors listed previously simply disappeared. So not a single bit of data loss!

The less good news: this does indicate, strongly, that under these use cases, there is a software defect in OpenZFS.

Oct 09 '24 22:10 Rudd-O

I see listed got pool status for pool chest but pool props for pool penny. Just making sure you've got the right copypasta here.

Next time it happens, it'd be good to see any errors in the kernel log, that is, the zio ... lines, and any time-related entries from drivers etc. Also any relevant entries from zpool events -v (these are lost on reboot, so grab those first).

Oct 09 '24 23:10 robn

Good catch, let me fix the comment above. In a minute.

Oct 11 '24 10:10 Rudd-O

Updated.

Oct 11 '24 10:10 Rudd-O

Any chance you have a test system available which could be used to help us bisect where this was introduced? I'd suggest starting by testing with https://github.com/openzfs/zfs/commit/ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a and the kernel-6.8.11-300.fc40.x86_64 kernel.

Oct 11 '24 19:10 behlendorf

Hm! I have a laptop with ZFS atop LUKS and I think it has mirroring set up. It was one of the systems I was ready to upgrade, although that kernel might not boot on Xen (which is what the laptop boots first). Let me get back to you soonish with more info.

Oct 12 '24 12:10 Rudd-O

Tried rc1 with proposed rc2 patchset on root on zfs on luks w kernel 6.10. I got 800 write errors with lines like this in the log: zio pool=kc3000 vdev=/dev/mapper/root error=5 type=2 offset=25950006681088 size=4096 flags=317825 SSD is presenting 4k sectors and luks2 is also using 4k sectors. Looks very much like #14533, with discussion ongoing in #15646, but that it's now failing on metadata writes too.

edit: after rebooting with zfs-2.2.6 and running a scrub all errors disappeared, so I guess there was at least one working copy of all metadata.

edit2: after another restart even 2.2.6 generates write errors, scrub ends in 2 seconds and there's a permanent error in kc3000/ROOT/gentoo:<0xfffffffffffffffe> seems like something actually broke permanently

Oct 12 '24 14:10 thulle

I was able to reproduce the write errors pretty easily on Arch Linux with kernels 6.6.56 (LTS) and 6.11.3 with zfs 2.3.0 rc1 (commit 3a9fca901b) even on a single, non-mirrored LUKS device. I wasn't able to reproduce it without LUKS so far. I used default settings for everything (cryptsetup, zpool) with a freshly created pool; the LUKS device automatically decided to use 4096 byte blocks. I then rsynced my home directory to the pool and observed "zpool status" where write errors started to appear after a few seconds.

Oct 12 '24 22:10 ochilan

So, I tried to find a commit which worked at some point on the master branch, but I wasn't able to find a good commit so far. The 2.2 branch does work for me, but it has diverged a lot from master in both directions (ahead/behind).

I have a question: Did the fixes that were performed on 2.2 for issues like #15533 or #15646 find their way back into master and thus into 2.3? If so, I would be interested in a pointer which commit to start from on master to try finding a future commit that breaks, relative to that point in time.

Oct 13 '24 18:10 ochilan

Did the fixes that were performed on 2.2 for issues like https://github.com/openzfs/zfs/issues/15533 or https://github.com/openzfs/zfs/issues/15646 find their way back into master and thus into 2.3?

Yes, all of the work for these issues was originally done in master and then backported to the 2.2 releases. So both 2.2 and 2.3 include these changes. Where things do differ is that 2.2 releases default to using the classic IO disk code, while 2.3 defaults to the new method implemented in #15588. Would you mind testing both with zfs_vdev_disk_classic=1 (2.2 default) and zfs_vdev_disk_classic=0 (2.3 default) module option. That may help us narrow it down.

     zfs_vdev_disk_classic=0|1 (uint)
             If set to 1, OpenZFS will submit IO to Linux using the method it used
             in 2.2 and earlier.  This "classic" method has known issues with highly
             fragmented IO requests and is slower on many workloads, but it has been
             in use for many years and is known to be very stable.  If you set this
             parameter, please also open a bug report why you did so, including the
             workload involved and any error messages.

             This parameter and the classic submission method will be removed once
             we have total confidence in the new method.

             This parameter only applies on Linux, and can only be set at module
             load time.

Oct 15 '24 00:10 behlendorf

Thank you for your reply. I tested some combinations with zfs 2.3.0 rc1, and zfs_vdev_disk_classic does indeed make a difference in the amount of errors. Also, I was only able to get errors when using LUKS with 4k sectors, not with 512 bytes:

kernel	classic	sector size	errors
6.6.56	0	4k	low
6.6.56	1	4k	high
6.6.56	0	512	none
6.6.56	1	512	none
6.1.122	0	4k	low

where "low" errors means single-digit write errors in about 30 seconds, "high" means thousands of errors.

Oct 15 '24 07:10 ochilan

Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.

With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).

Oct 15 '24 10:10 Rudd-O

I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.

Oct 15 '24 11:10 ochilan

I tested 2.3.0-rc2 with kernel 6.11.3 today on endeavouros and got data corruptions too. All devices in my pools are LUKS2 devices (sector: 4096 [bytes]). After a few rsync and zfs send/receive operations I got this message from zed:

ZFS has finished a resilver:

   eid: 406
 class: resilver_finish
  host: rakete
  time: 2024-10-15 14:13:52+0200
  pool: zstore
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 16K in 00:00:11 with 0 errors on Tue Oct 15 14:13:52 2024
config:

	NAME                                     STATE     READ WRITE CKSUM
	zstore                                   ONLINE       0     0     0
	  mirror-0                               ONLINE       0     8     0
	    WD40EFRX-68N32N0-WCC7K4JKN36U        ONLINE       0     9     1
	    WD40EFRX-68N32N0-WCC7K6YX6SAA        ONLINE       0     8     0
	  mirror-1                               ONLINE       0     4     0
	    WD40EFRX-68N32N0-WCC7K2YAJA69        ONLINE       0     8     3
	    WD40EFRX-68N32N0-WCC7K6XALFKX        ONLINE       0     4     0
	cache
	  ata-Verbatim_Vi550_S3_493504108895630  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

And in the journal I see plenty of this:

Okt 15 13:38:15 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3449457299456 size=4096 flags=3145856
Okt 15 13:38:15 rakete zed[19356]: eid=44 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K2YAJA69 size=4096 offset=3449457299456 priority=3 err=5 flags=0x300080 bookmark=36123:267484:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete zed[19377]: eid=45 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K6YX6SAA size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19382]: eid=46 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K4JKN36U size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete zed[19383]: eid=47 class=io pool='zstore' size=4096 offset=3185920499712 priority=3 err=5 flags=0x204080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19387]: eid=48 class=data pool='zstore' priority=3 err=5 flags=0xc001 bookmark=36123:284249:0:0

Also affected is my external backup pool which is a RAID1 hosted in a QNAP expernal USB case.

Oct 15 '24 15:10 mabod

I did another test today and could reproduce the data corruption immediately. Interesting: It only happened on the RAID10 pool "zstore", while my tests with the single drive pool "ztank" finished without error. Same with my previous post: The other affected pool mentioned before ('zsea') is also RAID1.

I have captured zevents before reboot. zpool-events-v-2.3.0.txt In it you see all my pools (zM2, zHome, zstore and ztank). Only zstore and ztank were tested with write operations (zfs send from zM2 to zstore or ztank).

Outcome was:

  pool: zstore
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.37G in 00:00:39 with 0 errors on Wed Oct 16 07:59:05 2024
config:

	NAME                                     STATE     READ WRITE CKSUM
	zstore                                   DEGRADED     0     0     0
	  mirror-0                               DEGRADED     0    14     0
	    WD40EFRX-68N32N0-WCC7K4JKN36U        ONLINE       0    16     2
	    WD40EFRX-68N32N0-WCC7K6YX6SAA        DEGRADED     0    15     0  too many errors
	  mirror-1                               DEGRADED     0     9     0
	    WD40EFRX-68N32N0-WCC7K2YAJA69        DEGRADED     0    22    13  too many errors
	    WD40EFRX-68N32N0-WCC7K6XALFKX        ONLINE       0    23     0
	cache
	  ata-Verbatim_Vi550_S3_493504108895630  ONLINE       0     0     0

errors: No known data errors

The journal is full of zio errors:

...
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3178625
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3178625
...

Oct 16 '24 06:10 mabod

Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.

With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).

Could you reproduce it on a single disc pool or is it all mirrors on your side?

Oct 16 '24 08:10 mabod

I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.

That's helpful to know and actually a bit surprising. Could someone use git bisect to determine where this was introduced on the master branch. I'd suggest starting with commit c6be6ce1755a3d9a3cbe70256cd8958ef83d8542 which introduced the new vdev disk code.

Oct 16 '24 17:10 behlendorf

I've been able to reproduce 100% on Fedora 40 with master (38a04f0a7c17885385187ce02aa7595aef6f3d4a) using this:

#!/bin/bash
# We assume this script is being run from inside a ZFS source dir

VDEV=/tmp/file

truncate -s 2G $VDEV

PASS="fdsjfosdijfsdkjsldfjdlk"

echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile

sudo ./zpool create tank /dev/mapper/tankfile

# Copy ZFS source to /tank to generate writes
sudo cp -a . /tank

sudo ./zpool status -v

sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV

~/zfs$ sudo ./reproduce.sh 
setting up LUKS
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  tankfile  ONLINE       0     8     0

errors: Permanent errors have been detected in the following files:

        /tank/module/icp/api/kcf_cipher.c
        /tank/man/man8/zfs-snapshot.8
        /tank/module/Makefile.in
        /tank/module/icp/asm-x86_64/aes/aes_aesni.o
        /tank/module/icp/asm-x86_64/aes/libicp_la-aes_aesni.o
        /tank/lib/libzpool/Makefile.am
        /tank/include/os/linux/spl/sys/cred.h

NOTE: I was able to reproduce these errors with zfs_vdev_disk_classic set to either 0 or 1!

Next step is to bisect it down.

Oct 16 '24 21:10 tonyhutter

I just tried with c6be6ce and get pretty much the same behavior. With vdev_disk_classic=0, the errors counts appear to be even slightly higher compared to 2.3.0rc1 for me, but that could also be randomness. I don't know if I'm doing something fundamentally wrong, it would be nice if someone else could verify.

Having said that: In the discussion around issue #15533, it was mentioned that reverting commit bd7a02c would be a possible solution, or having the "rewrite of the BIO filling machinery". Taking a look at zio.c, do I see it correctly that the mentioned commit was and is actually reverted in the 2.2 branch, but it is not reverted in master? I have not tried reverting that commit in master or 2.3.0rc yet, but maybe that would be interesting.

Edit: As a matter of fact, @Rudd-O stated exactly what I just said in the linked issue in January already regarding the revert not being in master.

Oct 16 '24 21:10 ochilan

I just tried my reproducer on Almalinux 9 and was able to hit the errors, but it took a few more times than normal.

Oct 16 '24 21:10 tonyhutter

I was reading inssue https://github.com/openzfs/zfs/issues/15533 mentioned my @ochilan. Seems to be very much related to this one here.

Interestingly it mentions:

Also, while looking for a reliable way to reproduce this, I noticed that plain old writes, like:

sudo dd if=/dev/urandom of=/test/test.bin bs=1M count=102400 status=progress

weren't sufficient. That never triggered the WRITE errors. However, I had a 100% reproducibility success rate by receiving zfs snapshots.

This is exactly what I am experiencing. rsync does not reproduce the issue for me, but receiving a snapshot triggers the errors almost immediately.

Oct 16 '24 22:10 mabod

As others have mentioned, bd7a02c251d8c119937e847d5161b512913667e6 appears to be the issue. Using AlmaLinux 9, I checked out bd7a02c251d8c119937e847d5161b512913667e6 and the reproducer hit immediately. I then checked out the commit before that (e82e68400a5be0e82d10044d4f70ebaed1547cb9) and ran the reproducer 10 times without hitting any errors.

Oct 16 '24 22:10 tonyhutter

Since I wasn't able to quickly and automatically revert bd7a02c in master due to huge changes since then (shouldn't be too difficult, but my brain is almost in shut down mode for today), I at least tried commit c6be6ce plus the revert, and I don't get errors so far. Even though the new vdev_disk stuff improves the situation a lot, it seems like the issues with that commit are still not completely mitigated.

Oct 16 '24 22:10 ochilan

Note that we did revert https://github.com/openzfs/zfs/commit/bd7a02c251d8c119937e847d5161b512913667e6 in zfs-2.2.2 with 89fcb8c6f. Unfortunately that revert doesn't apply cleanly to master. Pinging @amotin

Oct 16 '24 22:10 tonyhutter

@tonyhutter Do you see anything wrong about that patch other than it makes something else to misbehave? If you have some reproduction, then we should try to collect maximum details about those failed ZIOs, looking if something is misaligned there. Can we dump maximum details including page offsets, lengths, etc?

It also looks suspicious to me that the problem is reproducible only with LUKs, but not on any actual hardware. If as was told it happens only on LUKS with 4KB sectors, does ZFS properly sets ashift=12 there?

Oct 16 '24 22:10 amotin

I updated my reproducer to vary the write size and the write offset. I also manually set ashift=12:

#!/bin/bash

# We assume this script is being run from inside a ZFS source dir

VDEV=/tmp/file

truncate -s 100M $VDEV

PASS="fdsjfosdijfsdkjsldfjdlk"

echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile

sudo ./zpool create -o ashift=12 tank /dev/mapper/tankfile

CPUS=$(nproc)
for SIZE in {1..100} ; do
	for OFF in {1..100} ; do
		for i in {1..$CPUS} ; do
			dd if=/dev/urandom of=/tank/file$i-bs$SIZE-off$OFF seek=$OFF bs=$SIZE count=1 &>/dev/null &
		done
		wait

		errors=$(sudo ./zpool status -j --json-int --json-flat-vdevs | jq '.pools[].vdevs.tankfile.write_errors')

		if [ "$errors" != "$old_errors" ] && [ "$errors" -gt 0 ] ; then
			echo "ERROR: errors=$errors, Wrote $SIZE bytes to offset $(($OFF * $SIZE))"
		fi
		old_errors=$errors

	done
	sudo rm -fr /tank/*
done

sudo ./zpool status -v
sudo ./zpool destroy tank

sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV

It seems some writes sizes and offsets will strongly trigger the errors and others will not. Examples:

ERROR: errors=3, Wrote 75 bytes to offset 5100
ERROR: errors=4, Wrote 77 bytes to offset 5159
ERROR: errors=7, Wrote 79 bytes to offset 4898
ERROR: errors=11, Wrote 81 bytes to offset 4050
ERROR: errors=15, Wrote 84 bytes to offset 6552
ERROR: errors=20, Wrote 86 bytes to offset 5332
ERROR: errors=29, Wrote 89 bytes to offset 7120
ERROR: errors=33, Wrote 91 bytes to offset 5460
ERROR: errors=39, Wrote 94 bytes to offset 7144
ERROR: errors=40, Wrote 97 bytes to offset 7663

ERROR: errors=3, Wrote 73 bytes to offset 5767
ERROR: errors=5, Wrote 74 bytes to offset 6660
ERROR: errors=6, Wrote 75 bytes to offset 7200
ERROR: errors=8, Wrote 85 bytes to offset 5695
ERROR: errors=11, Wrote 86 bytes to offset 5848
ERROR: errors=12, Wrote 87 bytes to offset 5568
ERROR: errors=15, Wrote 90 bytes to offset 5310
ERROR: errors=17, Wrote 91 bytes to offset 5551

ERROR: errors=2, Wrote 70 bytes to offset 5810
ERROR: errors=6, Wrote 78 bytes to offset 6162
ERROR: errors=14, Wrote 79 bytes to offset 6715
ERROR: errors=16, Wrote 80 bytes to offset 6960
ERROR: errors=17, Wrote 83 bytes to offset 7138
ERROR: errors=18, Wrote 87 bytes to offset 6612
ERROR: errors=19, Wrote 88 bytes to offset 6512
ERROR: errors=24, Wrote 89 bytes to offset 6052
ERROR: errors=26, Wrote 90 bytes to offset 6210
ERROR: errors=27, Wrote 91 bytes to offset 6006
ERROR: errors=28, Wrote 92 bytes to offset 5796
ERROR: errors=30, Wrote 94 bytes to offset 4982

Oct 21 '24 21:10 tonyhutter

I'm sorry, Tony, but how do you expect to see a correlation between user file writes and ZFS disk writes, considering the second are issued asynchronously by sync threads? If nothing else, I would insert zpool sync somewhere in between user writes and check for errors.

Oct 22 '24 01:10 amotin

@amotin I'm just reporting what I'm seeing. I found it interesting the reproducer script would generate a lot of write errors in some size/offset ranges, and none in other size/offset ranges. I don't have any insights beyond that.

Oct 22 '24 02:10 tonyhutter

Ok. Thanks. Just without explicit txg commits and with delays between files and disk writes how do we know it is not a coincedence? I'll try to play with it, but @robn could be better.

Oct 22 '24 02:10 amotin

Oh, I'd missed this thread!

A reproducer, my stars. Well done!

The quick summary of everything I think I know from #15646 and #15533:

it has existed in some form since at least the 2.1 series
has only been seen in conjunction with LUKS2
has always involved some sort of transition between 512b and 4K sectors

But, I'd never been able to reproduce it myself, not for lack of trying!

From debug traces gathered with by helpful folks willing to run experiments, I did some code reads, trying to follow the IO path down into the kernel, but never found anything obvious. My best guess is that there was something wrong with how we were filling in certain fields in the bio, possibly bi_sectors. Most of the block IO layer in Linux doesn't really care about these, but dm-crypt will mix many of these fields into the XTS IV. So if we get them wrong, we'd get decrypt failures.

I couldn't think of a good way to test this without being able to reproduce it myself, and it's hard to describe to people how to run blktrace or even bpftrace safely from afar, especially when I'm still just trying to gather information and I'm not sure if the bug is inducing further damage.

But anyway! A reproducer! So exciting! And by the looks of it, it does not run counter to what I've seen so far. And I certainly had not considered a race, which would explain it being hard to pin down: different pool structure, disks types, dm layers, kernel versions, OpenZFS version, could all affect the timing.

Of course it might be unrelated but it for sure has a similar smell!

Don't mind me, I'm just excited! I've got some time this afternoon so I'll see if I can reproduce it on the testbed, and go from there.

Oct 22 '24 03:10 robn

I did indeed spend the afternoon on it, and the evening, and now it's 2am and I really must go to sleep. Tomorrow (today :grimacing:) I travel home from my vacation, which is not going to be amazing when I'm this groggy.

Good news though. @tonyhutter's repro script is working nicely, and I made it a lot more aggressive for own testing, and now it starts raining errors pretty much as soon as the loop starts.

Better news though. I think I know what's going on. Short version: disk sectors split across memory pages. Again.

Background: Linux expects to be able to "split" a BIO by picking some point in the segment list, cutting the list in two, and giving one side of it to another BIO. (Ok, not quite like that, but that's the principle). So, there is an expectation that when split, a BIO will retain the correct alignment for the target device's block size.

OpenZFS used to quite aggressively split memory pages across BIO segments, and also had a tendency to overfill BIOs during heavy write loads, requiring the kernel to split the BIO to bring it under its fill limit. For 512-byte sectors, any split is fine, as we never loaded less than 512 bytes into a single BIO segment. For 4K though, an arbitrary split can cut on a non-4K boundary. This misaligned BIO will be rejected by many drivers.

#15588 and it's followups tried to address this in a few ways:

loading the largest amount of data possible into a single BIO segment
not overfilling the BIO, instead creating additional ones and filling them separately
inspecting a data allocation up front and, if it is misaligned, linearising it before loading it into the BIO.

There's some amount of belt-and-suspenders here: we try to make sure we create BIOs that are always safe to split, while at the same time trying to never put the kernel in a position where it has to split. This has worked well so far.

That brings us to dm-crypt. Inside, crypt_convert_block_skcipher() opens with:

	/* Reject unexpected unaligned bio. */
	if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
		return -EIO;

I threw a bunch of logging on this, and sure enough, the segment it was looking at was < 4K. I believe this to be the entire issue.

There is one place where where we don't load full pages into a BIO: when the data is smaller than the logical block size (say, 4K) and it's already split across two pages, of (say) 2K each. This seemed reasonable to me at the time, because it's impossible for a BIO maximum segments to be less than 16 (iirc), so it seemed there would never be a reason where the kernel would have to split a two-segment BIO.

Turns out dm-crypt throws a spanner in this. It has pretty exacting requirements about the shape of the data it recieves, mostly because size and offset are part of the math. So it will split or remap a BIO to get things into the shape it wants. Except, it seems it never expects an LBS worth of data to span two pages, so rejects it.

The big "fix" is probably to always allocate IO memory out of whole LBS-sized chunks, never the smaller packed allocations that we sometimes use. The more targetted one, I believe, is going to be to detect when we're about to load less than an LBS-sized chunk into a BIO, and instead, get an LBS-sized allocation directly from the page cache, copy the data in, copy the rest from the next page, and submit that page to the BIO instead. Sort of like covering visible seams with a patch.

Theory untested, but it feels right. I'll sleep on it and take a swing at it tomorrow on the train.

Oct 22 '24 15:10 robn