Today's update to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a (almost identical to current 2.3.0 RC) caused permanent pool corruption
One of my machines (FC40) recently received a kernel update
- from kernel-6.8.11-300.fc40.x86_64 to kernel-6.10.12-200.fc40.x86_64, and
- from ZFS+DKMS+Dracut
masterlineage, from commit 02c5aa9b092818785ed8db4e2246a828278138e3 to ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a
simultaneously. This took place earlier today. The pool was healthy, in use, and recently scrubbed multiple times. No error anywhere, in the kernel log, or in the journal.
Mere minutes after I rebooted to the new kernel and ZFS, my Prometheus setup alerted me to 30 checksum errors, several write errors, and 4 data errors. Upon inspection:
[root@penny ~]# zpool status -v
pool: chest
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
...abbreviated...
NAME STATE READ WRITE CKSUM
...abbreviated...
errors: Permanent errors have been detected in the following files:
<metadata>:<0x16>
<metadata>:<0x3c>
<metadata>:<0x44>
<metadata>:<0x594>
The kernel also didn't show any hardware errors in the kernel ring buffer.
I rebooted to the older kernel and ZFS module, and started a scrub. It's still ongoing, but it has not found any problem nor produced any WRITE or CKSUM errors.
Interestingly, neither the cache nor the ZIL devices had that error. The boot drive also seemed unaffected.
This, to me, indicates a software issue, probably related to the LUKS write path (we've had these before) or mirroring (only the pool with the mirrored drives were hit — the single boot drive was not hit by the problem despite being the same otherwise).
The affected LUKS2 devices are all whole-disk formatted with 4K sector size, and my pool is ashift 12. (The unaffected root pool, cache device and ZIL device for the affected pool are not formatted with LUKS 4K sector size).
In the interest of seeing if it makes a difference, the affected LUKS devices are tuned with the following persistent flags:
Flags: allow-discards same-cpu-crypt submit-from-crypt-cpus no-read-workqueue no-write-workqueue
The pool has the following features:
NAME PROPERTY VALUE SOURCE
chest size 10.9T -
chest capacity 67% -
chest altroot - default
chest health ONLINE -
chest guid 2537396116593781450 -
chest version - default
chest bootfs - default
chest delegation on default
chest autoreplace off default
chest cachefile - default
chest failmode wait default
chest listsnapshots off default
chest autoexpand off default
chest dedupratio 1.00x -
chest free 3.55T -
chest allocated 7.34T -
chest readonly off -
chest ashift 12 local
chest comment - default
chest expandsize - -
chest freeing 0 -
chest fragmentation 9% -
chest leaked 0 -
chest multihost off default
chest checkpoint - -
chest load_guid 16604087848420727134 -
chest autotrim off default
chest compatibility off default
chest bcloneused 0 -
chest bclonesaved 0 -
chest bcloneratio 1.00x -
chest feature@async_destroy enabled local
chest feature@empty_bpobj active local
chest feature@lz4_compress active local
chest feature@multi_vdev_crash_dump enabled local
chest feature@spacemap_histogram active local
chest feature@enabled_txg active local
chest feature@hole_birth active local
chest feature@extensible_dataset active local
chest feature@embedded_data active local
chest feature@bookmarks enabled local
chest feature@filesystem_limits enabled local
chest feature@large_blocks enabled local
chest feature@large_dnode enabled local
chest feature@sha512 enabled local
chest feature@skein enabled local
chest feature@edonr enabled local
chest feature@userobj_accounting active local
chest feature@encryption enabled local
chest feature@project_quota active local
chest feature@device_removal enabled local
chest feature@obsolete_counts enabled local
chest feature@zpool_checkpoint enabled local
chest feature@spacemap_v2 active local
chest feature@allocation_classes enabled local
chest feature@resilver_defer enabled local
chest feature@bookmark_v2 enabled local
chest feature@redaction_bookmarks enabled local
chest feature@redacted_datasets enabled local
chest feature@bookmark_written enabled local
chest feature@log_spacemap active local
chest feature@livelist enabled local
chest feature@device_rebuild enabled local
chest feature@zstd_compress enabled local
chest feature@draid enabled local
chest feature@zilsaxattr disabled local
chest feature@head_errlog disabled local
chest feature@blake3 disabled local
chest feature@block_cloning disabled local
chest feature@vdev_zaps_v2 disabled local
chest feature@redaction_list_spill disabled local
chest feature@raidz_expansion disabled local
This is where my pool currently sits:
[root@penny ~]# zpool status
pool: chest
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Wed Oct 9 22:37:54 2024
900G / 7.35T scanned at 950M/s, 136G / 7.35T issued at 143M/s
0B repaired, 1.80% done, 14:39:54 to go
config:
NAME STATE READ WRITE CKSUM
chest ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
dm-uuid-CRYPT-LUKS2-sda ONLINE 0 0 0
dm-uuid-CRYPT-LUKS2-sdb ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
dm-uuid-CRYPT-LUKS2-sdc ONLINE 0 0 0
dm-uuid-CRYPT-LUKS2-sdd ONLINE 0 0 0
logs
dm-uuid-CRYPT-LUKS2-sde ONLINE 0 0 0
cache
dm-uuid-CRYPT-LUKS2-sdf ONLINE 0 0 0
errors: 4 data errors, use '-v' for a list
Update: got good news! After reverting to the prior kernel + commit mentioned above, I am very happy to report that the scrub found no errors, and the data errors listed previously simply disappeared. So not a single bit of data loss!
The less good news: this does indicate, strongly, that under these use cases, there is a software defect in OpenZFS.
I see listed got pool status for pool chest but pool props for pool penny. Just making sure you've got the right copypasta here.
Next time it happens, it'd be good to see any errors in the kernel log, that is, the zio ... lines, and any time-related entries from drivers etc. Also any relevant entries from zpool events -v (these are lost on reboot, so grab those first).
Good catch, let me fix the comment above. In a minute.
Updated.
Any chance you have a test system available which could be used to help us bisect where this was introduced? I'd suggest starting by testing with https://github.com/openzfs/zfs/commit/ca0141f325ec706d38a06f9aeb8e5eb6c6a8d09a and the kernel-6.8.11-300.fc40.x86_64 kernel.
Hm! I have a laptop with ZFS atop LUKS and I think it has mirroring set up. It was one of the systems I was ready to upgrade, although that kernel might not boot on Xen (which is what the laptop boots first). Let me get back to you soonish with more info.
Tried rc1 with proposed rc2 patchset on root on zfs on luks w kernel 6.10. I got 800 write errors with lines like this in the log: zio pool=kc3000 vdev=/dev/mapper/root error=5 type=2 offset=25950006681088 size=4096 flags=317825 SSD is presenting 4k sectors and luks2 is also using 4k sectors. Looks very much like #14533, with discussion ongoing in #15646, but that it's now failing on metadata writes too.
edit: after rebooting with zfs-2.2.6 and running a scrub all errors disappeared, so I guess there was at least one working copy of all metadata.
edit2: after another restart even 2.2.6 generates write errors, scrub ends in 2 seconds and there's a permanent error in kc3000/ROOT/gentoo:<0xfffffffffffffffe> seems like something actually broke permanently
I was able to reproduce the write errors pretty easily on Arch Linux with kernels 6.6.56 (LTS) and 6.11.3 with zfs 2.3.0 rc1 (commit 3a9fca901b) even on a single, non-mirrored LUKS device. I wasn't able to reproduce it without LUKS so far. I used default settings for everything (cryptsetup, zpool) with a freshly created pool; the LUKS device automatically decided to use 4096 byte blocks. I then rsynced my home directory to the pool and observed "zpool status" where write errors started to appear after a few seconds.
So, I tried to find a commit which worked at some point on the master branch, but I wasn't able to find a good commit so far. The 2.2 branch does work for me, but it has diverged a lot from master in both directions (ahead/behind).
I have a question: Did the fixes that were performed on 2.2 for issues like #15533 or #15646 find their way back into master and thus into 2.3? If so, I would be interested in a pointer which commit to start from on master to try finding a future commit that breaks, relative to that point in time.
Did the fixes that were performed on 2.2 for issues like https://github.com/openzfs/zfs/issues/15533 or https://github.com/openzfs/zfs/issues/15646 find their way back into master and thus into 2.3?
Yes, all of the work for these issues was originally done in master and then backported to the 2.2 releases. So both 2.2 and 2.3 include these changes. Where things do differ is that 2.2 releases default to using the classic IO disk code, while 2.3 defaults to the new method implemented in #15588. Would you mind testing both with zfs_vdev_disk_classic=1 (2.2 default) and zfs_vdev_disk_classic=0 (2.3 default) module option. That may help us narrow it down.
zfs_vdev_disk_classic=0|1 (uint)
If set to 1, OpenZFS will submit IO to Linux using the method it used
in 2.2 and earlier. This "classic" method has known issues with highly
fragmented IO requests and is slower on many workloads, but it has been
in use for many years and is known to be very stable. If you set this
parameter, please also open a bug report why you did so, including the
workload involved and any error messages.
This parameter and the classic submission method will be removed once
we have total confidence in the new method.
This parameter only applies on Linux, and can only be set at module
load time.
Thank you for your reply. I tested some combinations with zfs 2.3.0 rc1, and zfs_vdev_disk_classic does indeed make a difference in the amount of errors. Also, I was only able to get errors when using LUKS with 4k sectors, not with 512 bytes:
| kernel | classic | sector size | errors |
|---|---|---|---|
| 6.6.56 | 0 | 4k | low |
| 6.6.56 | 1 | 4k | high |
| 6.6.56 | 0 | 512 | none |
| 6.6.56 | 1 | 512 | none |
| 6.1.122 | 0 | 4k | low |
where "low" errors means single-digit write errors in about 30 seconds, "high" means thousands of errors.
Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.
With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).
I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.
I tested 2.3.0-rc2 with kernel 6.11.3 today on endeavouros and got data corruptions too. All devices in my pools are LUKS2 devices (sector: 4096 [bytes]). After a few rsync and zfs send/receive operations I got this message from zed:
ZFS has finished a resilver:
eid: 406
class: resilver_finish
host: rakete
time: 2024-10-15 14:13:52+0200
pool: zstore
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 16K in 00:00:11 with 0 errors on Tue Oct 15 14:13:52 2024
config:
NAME STATE READ WRITE CKSUM
zstore ONLINE 0 0 0
mirror-0 ONLINE 0 8 0
WD40EFRX-68N32N0-WCC7K4JKN36U ONLINE 0 9 1
WD40EFRX-68N32N0-WCC7K6YX6SAA ONLINE 0 8 0
mirror-1 ONLINE 0 4 0
WD40EFRX-68N32N0-WCC7K2YAJA69 ONLINE 0 8 3
WD40EFRX-68N32N0-WCC7K6XALFKX ONLINE 0 4 0
cache
ata-Verbatim_Vi550_S3_493504108895630 ONLINE 0 0 0
errors: 1 data errors, use '-v' for a list
And in the journal I see plenty of this:
Okt 15 13:38:15 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3449457299456 size=4096 flags=3145856
Okt 15 13:38:15 rakete zed[19356]: eid=44 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K2YAJA69 size=4096 offset=3449457299456 priority=3 err=5 flags=0x300080 bookmark=36123:267484:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete zed[19377]: eid=45 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K6YX6SAA size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3145856
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6YX6SAA error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19382]: eid=46 class=io pool='zstore' vdev=WD40EFRX-68N32N0-WCC7K4JKN36U size=4096 offset=3185924694016 priority=3 err=5 flags=0x300080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete zed[19383]: eid=47 class=io pool='zstore' size=4096 offset=3185920499712 priority=3 err=5 flags=0x204080 bookmark=36123:284249:0:0
Okt 15 13:38:21 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K4JKN36U error=5 type=2 offset=3185924694016 size=4096 flags=3178625
Okt 15 13:38:21 rakete zed[19387]: eid=48 class=data pool='zstore' priority=3 err=5 flags=0xc001 bookmark=36123:284249:0:0
Also affected is my external backup pool which is a RAID1 hosted in a QNAP expernal USB case.
I did another test today and could reproduce the data corruption immediately. Interesting: It only happened on the RAID10 pool "zstore", while my tests with the single drive pool "ztank" finished without error. Same with my previous post: The other affected pool mentioned before ('zsea') is also RAID1.
I have captured zevents before reboot. zpool-events-v-2.3.0.txt In it you see all my pools (zM2, zHome, zstore and ztank). Only zstore and ztank were tested with write operations (zfs send from zM2 to zstore or ztank).
Outcome was:
pool: zstore
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 3.37G in 00:00:39 with 0 errors on Wed Oct 16 07:59:05 2024
config:
NAME STATE READ WRITE CKSUM
zstore DEGRADED 0 0 0
mirror-0 DEGRADED 0 14 0
WD40EFRX-68N32N0-WCC7K4JKN36U ONLINE 0 16 2
WD40EFRX-68N32N0-WCC7K6YX6SAA DEGRADED 0 15 0 too many errors
mirror-1 DEGRADED 0 9 0
WD40EFRX-68N32N0-WCC7K2YAJA69 DEGRADED 0 22 13 too many errors
WD40EFRX-68N32N0-WCC7K6XALFKX ONLINE 0 23 0
cache
ata-Verbatim_Vi550_S3_493504108895630 ONLINE 0 0 0
errors: No known data errors
The journal is full of zio errors:
...
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3145856
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K6XALFKX error=5 type=2 offset=3646354886656 size=4096 flags=3178625
Okt 16 07:58:02 rakete kernel: zio pool=zstore vdev=/dev/mapper/WD40EFRX-68N32N0-WCC7K2YAJA69 error=5 type=2 offset=3646354886656 size=4096 flags=3178625
...
Yes, it's only with 4K LUKS formatted devices. I could not repro with 512 byte LUKS formatted devices. How curious.
With newer LUKS tooling and OS installers defaulting to making LUKS volumes LUKS2, and making LUKS sector sizes the same as advanced format disks (minimum 4K), this is going to begin biting real hard on future ZFS users (as well as current ZFS users who buy new drives).
Could you reproduce it on a single disc pool or is it all mirrors on your side?
I re-tested with zfs 2.2.6 quickly just to make sure, and I'm not able to reproduce the issue there with 4K sectors (on a freshly created pool, as with all tests I did), neither with vdev_disk_classic 1 nor with 0.
That's helpful to know and actually a bit surprising. Could someone use git bisect to determine where this was introduced on the master branch. I'd suggest starting with commit c6be6ce1755a3d9a3cbe70256cd8958ef83d8542 which introduced the new vdev disk code.
I've been able to reproduce 100% on Fedora 40 with master (38a04f0a7c17885385187ce02aa7595aef6f3d4a) using this:
#!/bin/bash
# We assume this script is being run from inside a ZFS source dir
VDEV=/tmp/file
truncate -s 2G $VDEV
PASS="fdsjfosdijfsdkjsldfjdlk"
echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile
sudo ./zpool create tank /dev/mapper/tankfile
# Copy ZFS source to /tank to generate writes
sudo cp -a . /tank
sudo ./zpool status -v
sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV
~/zfs$ sudo ./reproduce.sh
setting up LUKS
pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
tankfile ONLINE 0 8 0
errors: Permanent errors have been detected in the following files:
/tank/module/icp/api/kcf_cipher.c
/tank/man/man8/zfs-snapshot.8
/tank/module/Makefile.in
/tank/module/icp/asm-x86_64/aes/aes_aesni.o
/tank/module/icp/asm-x86_64/aes/libicp_la-aes_aesni.o
/tank/lib/libzpool/Makefile.am
/tank/include/os/linux/spl/sys/cred.h
NOTE: I was able to reproduce these errors with zfs_vdev_disk_classic set to either 0 or 1!
Next step is to bisect it down.
I just tried with c6be6ce and get pretty much the same behavior. With vdev_disk_classic=0, the errors counts appear to be even slightly higher compared to 2.3.0rc1 for me, but that could also be randomness. I don't know if I'm doing something fundamentally wrong, it would be nice if someone else could verify.
Having said that: In the discussion around issue #15533, it was mentioned that reverting commit bd7a02c would be a possible solution, or having the "rewrite of the BIO filling machinery". Taking a look at zio.c, do I see it correctly that the mentioned commit was and is actually reverted in the 2.2 branch, but it is not reverted in master? I have not tried reverting that commit in master or 2.3.0rc yet, but maybe that would be interesting.
Edit: As a matter of fact, @Rudd-O stated exactly what I just said in the linked issue in January already regarding the revert not being in master.
I just tried my reproducer on Almalinux 9 and was able to hit the errors, but it took a few more times than normal.
I was reading inssue https://github.com/openzfs/zfs/issues/15533 mentioned my @ochilan. Seems to be very much related to this one here.
Interestingly it mentions:
Also, while looking for a reliable way to reproduce this, I noticed that plain old writes, like:
sudo dd if=/dev/urandom of=/test/test.bin bs=1M count=102400 status=progressweren't sufficient. That never triggered the WRITE errors. However, I had a 100% reproducibility success rate by receiving zfs snapshots.
This is exactly what I am experiencing. rsync does not reproduce the issue for me, but receiving a snapshot triggers the errors almost immediately.
As others have mentioned, bd7a02c251d8c119937e847d5161b512913667e6 appears to be the issue. Using AlmaLinux 9, I checked out bd7a02c251d8c119937e847d5161b512913667e6 and the reproducer hit immediately. I then checked out the commit before that (e82e68400a5be0e82d10044d4f70ebaed1547cb9) and ran the reproducer 10 times without hitting any errors.
Since I wasn't able to quickly and automatically revert bd7a02c in master due to huge changes since then (shouldn't be too difficult, but my brain is almost in shut down mode for today), I at least tried commit c6be6ce plus the revert, and I don't get errors so far. Even though the new vdev_disk stuff improves the situation a lot, it seems like the issues with that commit are still not completely mitigated.
Note that we did revert https://github.com/openzfs/zfs/commit/bd7a02c251d8c119937e847d5161b512913667e6 in zfs-2.2.2 with 89fcb8c6f. Unfortunately that revert doesn't apply cleanly to master. Pinging @amotin
@tonyhutter Do you see anything wrong about that patch other than it makes something else to misbehave? If you have some reproduction, then we should try to collect maximum details about those failed ZIOs, looking if something is misaligned there. Can we dump maximum details including page offsets, lengths, etc?
It also looks suspicious to me that the problem is reproducible only with LUKs, but not on any actual hardware. If as was told it happens only on LUKS with 4KB sectors, does ZFS properly sets ashift=12 there?
I updated my reproducer to vary the write size and the write offset. I also manually set ashift=12:
#!/bin/bash
# We assume this script is being run from inside a ZFS source dir
VDEV=/tmp/file
truncate -s 100M $VDEV
PASS="fdsjfosdijfsdkjsldfjdlk"
echo "setting up LUKS"
echo $PASS | sudo cryptsetup luksFormat --type luks2 $VDEV
echo $PASS | sudo cryptsetup luksOpen $VDEV tankfile
sudo ./zpool create -o ashift=12 tank /dev/mapper/tankfile
CPUS=$(nproc)
for SIZE in {1..100} ; do
for OFF in {1..100} ; do
for i in {1..$CPUS} ; do
dd if=/dev/urandom of=/tank/file$i-bs$SIZE-off$OFF seek=$OFF bs=$SIZE count=1 &>/dev/null &
done
wait
errors=$(sudo ./zpool status -j --json-int --json-flat-vdevs | jq '.pools[].vdevs.tankfile.write_errors')
if [ "$errors" != "$old_errors" ] && [ "$errors" -gt 0 ] ; then
echo "ERROR: errors=$errors, Wrote $SIZE bytes to offset $(($OFF * $SIZE))"
fi
old_errors=$errors
done
sudo rm -fr /tank/*
done
sudo ./zpool status -v
sudo ./zpool destroy tank
sudo cryptsetup luksClose /dev/mapper/tankfile
sudo rm -f $VDEV
It seems some writes sizes and offsets will strongly trigger the errors and others will not. Examples:
ERROR: errors=3, Wrote 75 bytes to offset 5100
ERROR: errors=4, Wrote 77 bytes to offset 5159
ERROR: errors=7, Wrote 79 bytes to offset 4898
ERROR: errors=11, Wrote 81 bytes to offset 4050
ERROR: errors=15, Wrote 84 bytes to offset 6552
ERROR: errors=20, Wrote 86 bytes to offset 5332
ERROR: errors=29, Wrote 89 bytes to offset 7120
ERROR: errors=33, Wrote 91 bytes to offset 5460
ERROR: errors=39, Wrote 94 bytes to offset 7144
ERROR: errors=40, Wrote 97 bytes to offset 7663
ERROR: errors=3, Wrote 73 bytes to offset 5767
ERROR: errors=5, Wrote 74 bytes to offset 6660
ERROR: errors=6, Wrote 75 bytes to offset 7200
ERROR: errors=8, Wrote 85 bytes to offset 5695
ERROR: errors=11, Wrote 86 bytes to offset 5848
ERROR: errors=12, Wrote 87 bytes to offset 5568
ERROR: errors=15, Wrote 90 bytes to offset 5310
ERROR: errors=17, Wrote 91 bytes to offset 5551
ERROR: errors=2, Wrote 70 bytes to offset 5810
ERROR: errors=6, Wrote 78 bytes to offset 6162
ERROR: errors=14, Wrote 79 bytes to offset 6715
ERROR: errors=16, Wrote 80 bytes to offset 6960
ERROR: errors=17, Wrote 83 bytes to offset 7138
ERROR: errors=18, Wrote 87 bytes to offset 6612
ERROR: errors=19, Wrote 88 bytes to offset 6512
ERROR: errors=24, Wrote 89 bytes to offset 6052
ERROR: errors=26, Wrote 90 bytes to offset 6210
ERROR: errors=27, Wrote 91 bytes to offset 6006
ERROR: errors=28, Wrote 92 bytes to offset 5796
ERROR: errors=30, Wrote 94 bytes to offset 4982
I'm sorry, Tony, but how do you expect to see a correlation between user file writes and ZFS disk writes, considering the second are issued asynchronously by sync threads? If nothing else, I would insert zpool sync somewhere in between user writes and check for errors.
@amotin I'm just reporting what I'm seeing. I found it interesting the reproducer script would generate a lot of write errors in some size/offset ranges, and none in other size/offset ranges. I don't have any insights beyond that.
Ok. Thanks. Just without explicit txg commits and with delays between files and disk writes how do we know it is not a coincedence? I'll try to play with it, but @robn could be better.
Oh, I'd missed this thread!
A reproducer, my stars. Well done!
The quick summary of everything I think I know from #15646 and #15533:
- it has existed in some form since at least the 2.1 series
- has only been seen in conjunction with LUKS2
- has always involved some sort of transition between 512b and 4K sectors
But, I'd never been able to reproduce it myself, not for lack of trying!
From debug traces gathered with by helpful folks willing to run experiments, I did some code reads, trying to follow the IO path down into the kernel, but never found anything obvious. My best guess is that there was something wrong with how we were filling in certain fields in the bio, possibly bi_sectors. Most of the block IO layer in Linux doesn't really care about these, but dm-crypt will mix many of these fields into the XTS IV. So if we get them wrong, we'd get decrypt failures.
I couldn't think of a good way to test this without being able to reproduce it myself, and it's hard to describe to people how to run blktrace or even bpftrace safely from afar, especially when I'm still just trying to gather information and I'm not sure if the bug is inducing further damage.
But anyway! A reproducer! So exciting! And by the looks of it, it does not run counter to what I've seen so far. And I certainly had not considered a race, which would explain it being hard to pin down: different pool structure, disks types, dm layers, kernel versions, OpenZFS version, could all affect the timing.
Of course it might be unrelated but it for sure has a similar smell!
Don't mind me, I'm just excited! I've got some time this afternoon so I'll see if I can reproduce it on the testbed, and go from there.
I did indeed spend the afternoon on it, and the evening, and now it's 2am and I really must go to sleep. Tomorrow (today :grimacing:) I travel home from my vacation, which is not going to be amazing when I'm this groggy.
Good news though. @tonyhutter's repro script is working nicely, and I made it a lot more aggressive for own testing, and now it starts raining errors pretty much as soon as the loop starts.
Better news though. I think I know what's going on. Short version: disk sectors split across memory pages. Again.
Background: Linux expects to be able to "split" a BIO by picking some point in the segment list, cutting the list in two, and giving one side of it to another BIO. (Ok, not quite like that, but that's the principle). So, there is an expectation that when split, a BIO will retain the correct alignment for the target device's block size.
OpenZFS used to quite aggressively split memory pages across BIO segments, and also had a tendency to overfill BIOs during heavy write loads, requiring the kernel to split the BIO to bring it under its fill limit. For 512-byte sectors, any split is fine, as we never loaded less than 512 bytes into a single BIO segment. For 4K though, an arbitrary split can cut on a non-4K boundary. This misaligned BIO will be rejected by many drivers.
#15588 and it's followups tried to address this in a few ways:
- loading the largest amount of data possible into a single BIO segment
- not overfilling the BIO, instead creating additional ones and filling them separately
- inspecting a data allocation up front and, if it is misaligned, linearising it before loading it into the BIO.
There's some amount of belt-and-suspenders here: we try to make sure we create BIOs that are always safe to split, while at the same time trying to never put the kernel in a position where it has to split. This has worked well so far.
That brings us to dm-crypt. Inside, crypt_convert_block_skcipher() opens with:
/* Reject unexpected unaligned bio. */
if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
return -EIO;
I threw a bunch of logging on this, and sure enough, the segment it was looking at was < 4K. I believe this to be the entire issue.
There is one place where where we don't load full pages into a BIO: when the data is smaller than the logical block size (say, 4K) and it's already split across two pages, of (say) 2K each. This seemed reasonable to me at the time, because it's impossible for a BIO maximum segments to be less than 16 (iirc), so it seemed there would never be a reason where the kernel would have to split a two-segment BIO.
Turns out dm-crypt throws a spanner in this. It has pretty exacting requirements about the shape of the data it recieves, mostly because size and offset are part of the math. So it will split or remap a BIO to get things into the shape it wants. Except, it seems it never expects an LBS worth of data to span two pages, so rejects it.
The big "fix" is probably to always allocate IO memory out of whole LBS-sized chunks, never the smaller packed allocations that we sometimes use. The more targetted one, I believe, is going to be to detect when we're about to load less than an LBS-sized chunk into a BIO, and instead, get an LBS-sized allocation directly from the page cache, copy the data in, copy the rest from the next page, and submit that page to the BIO instead. Sort of like covering visible seams with a patch.
Theory untested, but it feels right. I'll sleep on it and take a swing at it tomorrow on the train.