Frequent Checksum Errors During Scrub on ZFS Pool
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Debian |
| Distribution Version | 12 |
| Kernel Version | 6.10.3-amd64 |
| Architecture | amd64 |
| OpenZFS Version | zfs-2.2.5-1 |
Describe the problem you're observing
chksum errors on scrub - in rare cases none, in most cases 3-4 on diffrent drives. Sometimes not even the same drives. No smart errors logged.
Changed: RAM to ECC RAM new PSU connected 3 drives of the pool to another PCIe Sata card changed all cables to the drives
Problem war originally a Q&A discussion - but since im out of options on what could cause it - maybe a bug somewhere? https://github.com/openzfs/zfs/discussions/16445
Describe how to reproduce the problem
run a scrub. zpool scrub Data
Include any warning/errors/backtraces from the system logs
root@pve:~# zpool status -v
pool: Data
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Thu Aug 15 17:15:59 2024
35.6T / 84.1T scanned at 14.4G/s, 3.92T / 84.1T issued at 1.58G/s
516K repaired, 4.66% done, 14:24:32 to go
config:
NAME STATE READ WRITE CKSUM
Data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT4WHGT ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT5NTBG ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT5MER8 ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT0SCR9 ONLINE 0 0 1 (repairing)
ata-ST18000NM003D-3DL103_ZVT07MLA ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT9JNZR ONLINE 0 0 0
ata-ST18000NM003D-3DL103_ZVT9C7NM ONLINE 0 0 1 (repairing)
ata-ST18000NM003D-3DL103_ZVTAEL6D ONLINE 0 0 1 (repairing)
I dont want to blindly change out the CPU, Board or buy another HBA Controller - any ideas?
You said in the discussion you removed vfio and PCIe passthrough and still got checksum errors - that's not super surprising, depending on the origin. The question becomes, I think, whether you still get more on a second scrub on the HV - since, if we hypothesize these are errors on-disk, then yeah, you're going to find them on a scrub after disconnecting, the question becomes whether you find more later.
I would also remark that, from memory, AMD doesn't explicitly promise all the ECC functionality works on consumer boards, that's more of a "we don't stop board vendors from enabling it after testing it, good luck", so it's not impossible that there's still ECC errors that aren't being reported properly, though I wouldn't put that as the most likely answer without more data.
My first question would be what the errors look like - e.g. is it always reporting 0s for the checksums or something, which the events in zpool events -v should know. (You might need a debug build for it to report that, I never remember.)
My first suggestion would be, don't run 6.10 or 6.9 - even 2.2.5 doesn't claim 6.10 support, so let's back off to something earlier that's in the actual supported range and see if this keeps happening. I'd suggest pre-6.4, but just trying to pick something that people have been using for a while without screams.
My second question would be, does this go away if you change the parity or checksum computation to use a different implementation? (e.g. given:
# grep . /sys/module/{icp,zcommon,zfs}/parameters/{zfs_fletcher_4_impl,icp_aes_impl,icp_gcm_impl,zfs_vdev_raidz_impl} 2>/dev/null
/sys/module/icp/parameters/icp_aes_impl:cycle [fastest] generic x86_64 aesni
/sys/module/icp/parameters/icp_gcm_impl:cycle [fastest] avx generic pclmulqdq
/sys/module/zcommon/parameters/zfs_fletcher_4_impl:[fastest] scalar superscalar superscalar4 sse2 ssse3 avx2
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle [fastest] original scalar sse2 ssse3 avx2
picking something other than "fastest" for zfs_fletcher_4_impl, like superscalar4, which does not use any SIMD instructions, and seeing if the problem stops occurring.
I should note, by nature of doing this, it's going to be substantially slower at checksums, so your scrubs may take longer and much more CPU time, but, if we want to eliminate weird bugs like that, ...)
Intel had a fun bug in Sapphire Rapids chips recently where virtualization played badly with SIMD instructions, so I wouldn't be totally astonished if AMD turned up something strange in the same vein, though I don't know of one offhand.
My third question would be, what exactly is every physical connection step between the disks and any SATA/SAS/USB controllers involved? Because the last time I personally had a weird problem with checksum errors cropping up weirdly and going up over time, it was because something in my controller setup was masking a bunch of write errors, so they only showed up as errors when I tried to actually read them back later, and, naturally, they were not what I had expected. (Moving the disks to a different machine and controller temporarily made this very clear.) That was all a USB->SATA setup, so I don't think it's the specific case here, but just to point out, getting checksum errors back can be what happens if you failed to write something, somehow.
My naive guess, since it's not reporting any uncorrectable errors at the moment, is that somehow, occasionally a write isn't getting to disk on one or two disks, and isn't properly bubbling up an error somehow, but since it's P=2, you've got enough redundancy to recover from it when you notice it later.
So, if it were me, I'd make sure I had backups of everything on there (not because of some urgent risk I haven't mentioned, just on principle before experimenting), run two or three scrubs without changing anything else outside of the VM and see if the checksum error count keeps going up each time (the first time wouldn't be surprising, it's the second and third time that are more interesting).
Capture the errors from zpool events -v that crop up when it errors, share them (if there's any filenames in them feel free to delete them, they shouldn't matter). (Note that zpool events -v doesn't persist after reboot, so capture it then.)
If they didn't keep going up, we dig more into why the virtualization is making things odd; if they did, then we dig into why that, probably by me suggesting you do superscalar4 on fletcher_4_impl and doing a few scrubs again, which will likely take a bit.
(And if you have an older kernel than 6.9 to conveniently try, I'd try it after 2 scrubs outside of the VM but before changing any other settings - and then do a few scrubs again in that and see if it turns out differently.)
Oh right, I thought I vaguely remembered one.
And it's for Zen 2, isn't that fun. #14557
So maybe a newer microcode would be a good time for you. (You could also test by booting with noxsaves and changing nothing else, and seeing if the problem goes away...)
thank you so much for this wall of help!
ill first do a few scrubs and see if the errors increase (still on the HV itself)
zpool events -v: (snip)
Aug 17 2024 11:26:59.489108558 ereport.fs.zfs.checksum
class = "ereport.fs.zfs.checksum"
ena = 0xa8643cdeea300401
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x46e14f483fbb48c0
vdev = 0x9c76b56715dbd3b
(end detector)
pool = "Data"
pool_guid = 0x46e14f483fbb48c0
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
vdev_guid = 0x9c76b56715dbd3b
vdev_type = "disk"
vdev_path = "/dev/disk/by-id/ata-ST18000NM003D-3DL103_ZVT0SCR9-part1"
vdev_devid = "ata-ST18000NM003D-3DL103_ZVT0SCR9-part1"
vdev_ashift = 0x9
vdev_complete_ts = 0x8a8643ae5e04
vdev_delta_ts = 0x12b736
vdev_read_errors = 0x0
vdev_write_errors = 0x0
vdev_cksum_errors = 0x2
vdev_delays = 0x0
parent_guid = 0xf8dd7a25055d838
parent_type = "raidz"
vdev_spare_paths =
vdev_spare_guids =
zio_err = 0x0
zio_flags = 0x1000b0
zio_stage = 0x400000
zio_pipeline = 0x3e00000
zio_delay = 0x0
zio_timestamp = 0x0
zio_delta = 0x0
zio_priority = 0x4
zio_offset = 0xbae41798000
zio_size = 0x2b000
zio_objset = 0x283
zio_object = 0x11cdb
zio_level = 0x0
zio_blkid = 0xb59
bad_ranges = 0x560 0x580
bad_ranges_min_gap = 0x8
bad_range_sets = 0x4d
bad_range_clears = 0x3b
bad_set_bits = 0x92 0x10 0xc 0x25 0x81 0x80 0xaf 0x32 0xe6 0x84 0x21 0x8c 0x23 0x80 0xb9 0x50 0xc0 0x1 0x2 0x0 0xa0 0x3 0x8c 0xcb 0x18 0xc0 0x0 0x3 0x1c 0xb3 0x80 0x21
bad_cleared_bits = 0x0 0xc0 0x11 0x90 0x46 0x20 0x50 0x0 0x0 0x0 0xc 0x61 0x4 0xf 0x0 0xa4 0x5 0x72 0x95 0x19 0x44 0x80 0x60 0x0 0x1 0x9 0x28 0x10 0xa2 0xc 0x4c 0xa
time = 0x66c06ce3 0x1d27344e
eid = 0x17
Aug 17 2024 11:26:59.489108558 ereport.fs.zfs.checksum
class = "ereport.fs.zfs.checksum"
ena = 0xa8643ce597102401
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x46e14f483fbb48c0
vdev = 0xba078544e9c69f80
(end detector)
pool = "Data"
pool_guid = 0x46e14f483fbb48c0
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
vdev_guid = 0xba078544e9c69f80
vdev_type = "disk"
vdev_path = "/dev/disk/by-id/ata-ST18000NM003D-3DL103_ZVTAEL6D-part1"
vdev_devid = "ata-ST18000NM003D-3DL103_ZVTAEL6D-part1"
vdev_ashift = 0x9
vdev_complete_ts = 0x8a8643add8d6
vdev_delta_ts = 0x124079
vdev_read_errors = 0x0
vdev_write_errors = 0x0
vdev_cksum_errors = 0x4
vdev_delays = 0x0
parent_guid = 0xf8dd7a25055d838
parent_type = "raidz"
vdev_spare_paths =
vdev_spare_guids =
zio_err = 0x0
zio_flags = 0x1000b0
zio_stage = 0x400000
zio_pipeline = 0x3e00000
zio_delay = 0x0
zio_timestamp = 0x0
zio_delta = 0x0
zio_priority = 0x4
zio_offset = 0xbae417ed000
zio_size = 0x2b000
zio_objset = 0x283
zio_object = 0x11cdb
zio_level = 0x0
zio_blkid = 0xb5a
bad_ranges = 0x14560 0x14580
bad_ranges_min_gap = 0x8
bad_range_sets = 0x4b
bad_range_clears = 0x39
bad_set_bits = 0x14 0xa0 0xe1 0x1 0x30 0x82 0x93 0x49 0xc0 0x20 0x34 0x82 0x41 0x34 0xd0 0x20 0x6d 0x69 0x40 0xa0 0x91 0x44 0x9 0x20 0x9b 0xa0 0x1 0x90 0x58 0x85 0x2 0x40
bad_cleared_bits = 0x1 0x14 0x12 0x8c 0x0 0x4 0x4 0x4 0x2c 0x89 0xc1 0x0 0x82 0x0 0x20 0x44 0x0 0x86 0x2 0x1c 0x42 0x0 0x90 0x13 0x24 0x5a 0x18 0x22 0x7 0x2 0x8 0x26
time = 0x66c06ce3 0x1d27344e
eid = 0x18
if you need more of the log i'll upload it
I'll see if i can get back to kernel 6.4 or if this will break my ARC gpu support. Microcode update should have already happened, I'll check.
For whats involved between disks and system:
8 seagate x20 going directly to the 8 onboard SATA ports on the ASROCK Phantom Gaming Pro 4, tried diffrent Cables, all new ones
For power: currently one cable from the PSU to 5 disks and another cable to the others with a sata-power-split cable The previous PSU had enough sata ports to not need a splitter but also errored
Anything i forgot?
PS
I would also remark that, from memory, AMD doesn't explicitly promise all the ECC functionality works on consumer boards, that's more of a "we don't stop board vendors from enabling it after testing it, good luck", so it's not impossible that there's still ECC errors that aren't being reported properly, though I wouldn't put that as the most likely answer without more data.
Oh nice to know. I'll prob. get server HW on the next iteration.
multiple scrubs - errorcount keeps rising
will test micocode update and older kernel next
Did you try the noxsaves kernel parameter? That'd be my suggestion as most likely to help.
Did you try the
noxsaveskernel parameter? That'd be my suggestion as most likely to help.
I did just now and after an hour of scrub the first error appeared (I did a zpool clear before). Should I do a second scrub after to see if they keep appearing?
I'd let the first one finish and then run a second one, to be sure - since the problem there is "it screwed up a calculation sometimes", if it is that problem, it's possible these are actual incorrect things written out, the first time.
The second pass is the more interesting question.
Second one also errors
Fun thing: now on every disk for the first time
ZFS verifies checksums per block. If the block is not trivially small, it may be difficult to say where the error actually happened.
PS: Rather than running scrubs in a loop collecting errors I would run good long memory test.
Did a 10h memory test with the old ram New ram (unbuf ECC) is new
did a bios and with it - microcode update - no changes.
Still working on getting my setup to run on 6.1 or 6.3 kernel.
Should I still change the algorythm zfs uses to calculate checksums? would i have to rewrite the data on the disks?
Should I still change the algorythm zfs uses to calculate checksums?
Yes, that would be great to rule out any FPU/SIMD related problems. Assuming you're using fletcher4 checksums and no encryption
echo scalar > /sys/module/zfs/parameters/zfs_fletcher_4_impl
echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
should do the trick.
would i have to rewrite the data on the disks?
No, by doing the above you're just changing the implementation of the algorithm used. The algorithm itself is still defined by the datasets checksum property.
(unbuf ECC) is new
New memory can have bit errors, happened to me once and your symtoms sound familiar. Don't assume, verify with proof: memtest the new RAM.
Should I still change the algorythm zfs uses to calculate checksums?
Yes, that would be great to rule out any FPU/SIMD related problems. Assuming you're using fletcher4 checksums and no encryption
echo scalar > /sys/module/zfs/parameters/zfs_fletcher_4_impl echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_implshould do the trick.
would i have to rewrite the data on the disks?
No, by doing the above you're just changing the implementation of the algorithm used. The algorithm itself is still defined by the datasets
checksumproperty.
THAT did it! two rows with zero errors in a row. Is this persistent?
Also where could the underlaying error be? I doubt the HW a bit - as it was running in truenas before without errors and also with vfio
Zfs version? Kernel version? Both?
It'll reset on reboot, but if changing that helped you, you have SIMD issues.
I'd still suspect an interaction with that Zen 2 erratum, absent more data, since there's not been a flood of 500+ people complaining about this. I'll go try to work up a patch that'd provide more useful information and hooks for debugging if I get a chance.
Here, this patch will log something like:
[ 232.980881] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 1 xsaveopt: 1 xsave: 1 fxsr: 1
on module load.
https://github.com/rincebrain/zfs/commit/07cd079f7a6ea4b5f3526ab8ebf04a786245a53a.patch
I've been sketching something like this so that people don't have to apply debugging patches to get this information out live, but haven't gotten back to it in a bit.
The key interesting point, to me, would be if you could build with this patch, then try loading it once with and without noxsaves set as a kernel parameter, because my suspicion would be something about that isn't applying to our check.
But that's just a guess.
without noxsaves:
[ 7.039568] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 0 xsaveopt: 1 xsave: 1 fxsr: 1
with noxsaves: cmdline:
[ 7.418467] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 0 xsaveopt: 1 xsave: 1 fxsr: 1
(noxsaves on both the Proxmox and VM cmdlines)
Cute, so it was already set to 0, I would guess because the kernel detected the same sort of concern that I mentioned.
That makes this all the more fascinating, I suppose. Because that erratum is explicitly only supposed to be if we use xsaves, and that should never happen with these settings.
I believe the term of art here is "uh-oh."
If random things aren't crashing with the scalar setting (try openssl speed in a loop for a bit and check that, though), then it's likely OpenZFS who is somehow interacting with something so that save/restore is turning out surprisingly, but I'm not immediately sure how, since that setting very explicitly will stop us from trying xsaves, and that erratum explicitly claims it's only that instruction.
...of course, something something theory and practice.
If I were guessing here, also, Intel had a bug relatively recently where SIMD state didn't get saved/restored properly across VMENTER/VMEXIT.
So stopping all VMs on the system then running a couple scrubs with the tunables set to non-scalar values (e.g. the defaults) might also be informative.
(I tried, but couldn't reproduce this on my Zen 3 system with the same kernel and ZFS version, even with VMs running, so I don't think it's a quirk of Linux 6.10 or something...)
Well, as long as running on bare metal, zfs' own calculations shouldn't be impacted by any XSAVE bug. SIMD usage in zfs does not depend on SIMD state. All SIMD usage is along the lines of
- disable interrupts and preemption
- save SIMD state
- do the calculations in one go
- restore SIMD state
- enable interrupts and preemption
So any bug in the save/restore procedure would "just" clobber the registers of other ppl, not our own ones.
If running virtualized zfs of course depends on the HV presenting a consistent SIMD state.
@KoffeinKaio It would be interesting to see if you can run mprime -t for a couple of hours on bare metal, ideally with the zfs modules unloaded or at least with the fletcher and raidz implementations set to scalar. mprime -t is really good at stress testing SIMD operations and will exit on the first error it encounters. Be warned though, it will peg your CPU at 100%. This way we could exclude the possibility of a flaky SIMD unit.
If it passes on bare metal you could repeat the test on a VM as well. This would ensure that the HV is handling SIMD state correctly.
@KoffeinKaio BTW, you can make the fletcher/raidz settings permanent by creating a file in /etc/modprobe.d and adding appropriate entries to it. Off the top of my head that would be
options zfs zfs_fletcher_4_impl=scalar
options zfs zfs_vdev_raidz_impl=scalar
will see if i can find time running mprime - unloading zfs might not be possible unless i boot some rescue
I can - but only if really neccessary. will try on the HV without unloading first
Running mprime -t with the fletcher/raidz implementations set to scalar should be fine. Let's see how it goes.
Running without problems for like 10 to 11 hrs
Just double checked and it seems my memory served me wrong., mprime will not exit on failures. Can you please inspect the results.txt file located in the directory you where in when you started mprime? grep -E 'ERROR|failure' results.txt shouldn't show up anything.
shouldn't show up anything.
well...
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Final result was 9BC4E89E, expected: 62266123.
Hardware failure detected running 16K FFT size, consult stress.txt file.
That's unfortunate. The mprime failure is a strong indication of a hardware issue. The next step would be to narrow down which part of the hardware misbehaves. This isn't an easy task though.
I'd love if you have some ideas on how to do that. Otherwise I'll test the default stuff the internet says - give the cpu a tiny bit mor evoltage, test RAM or in the end, buy a new CPU (if it is only the cpu?).
"Rounding error" almost certainly means the CPU, yes, and unless you changed the power from stock before, I would be surprised if tweaking the voltages reliably improved your outcomes.
For shits and giggles, what do the "microcode" lines in /proc/cpuinfo say?