zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Frequent Checksum Errors During Scrub on ZFS Pool

Open KoffeinKaio opened this issue 1 year ago • 49 comments

System information

Type Version/Name
Distribution Name Debian
Distribution Version 12
Kernel Version 6.10.3-amd64
Architecture amd64
OpenZFS Version zfs-2.2.5-1

Describe the problem you're observing

chksum errors on scrub - in rare cases none, in most cases 3-4 on diffrent drives. Sometimes not even the same drives. No smart errors logged.

Changed: RAM to ECC RAM new PSU connected 3 drives of the pool to another PCIe Sata card changed all cables to the drives

Problem war originally a Q&A discussion - but since im out of options on what could cause it - maybe a bug somewhere? https://github.com/openzfs/zfs/discussions/16445

Describe how to reproduce the problem

run a scrub. zpool scrub Data

Include any warning/errors/backtraces from the system logs

root@pve:~# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Aug 15 17:15:59 2024
	35.6T / 84.1T scanned at 14.4G/s, 3.92T / 84.1T issued at 1.58G/s
	516K repaired, 4.66% done, 14:24:32 to go
config:

	NAME                                   STATE     READ WRITE CKSUM
	Data                                   ONLINE       0     0     0
	  raidz2-0                             ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT4WHGT  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT5NTBG  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT5MER8  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT0SCR9  ONLINE       0     0     1  (repairing)
	    ata-ST18000NM003D-3DL103_ZVT07MLA  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT9JNZR  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT9C7NM  ONLINE       0     0     1  (repairing)
	    ata-ST18000NM003D-3DL103_ZVTAEL6D  ONLINE       0     0     1  (repairing)

I dont want to blindly change out the CPU, Board or buy another HBA Controller - any ideas?

KoffeinKaio avatar Aug 15 '24 16:08 KoffeinKaio

You said in the discussion you removed vfio and PCIe passthrough and still got checksum errors - that's not super surprising, depending on the origin. The question becomes, I think, whether you still get more on a second scrub on the HV - since, if we hypothesize these are errors on-disk, then yeah, you're going to find them on a scrub after disconnecting, the question becomes whether you find more later.

I would also remark that, from memory, AMD doesn't explicitly promise all the ECC functionality works on consumer boards, that's more of a "we don't stop board vendors from enabling it after testing it, good luck", so it's not impossible that there's still ECC errors that aren't being reported properly, though I wouldn't put that as the most likely answer without more data.

My first question would be what the errors look like - e.g. is it always reporting 0s for the checksums or something, which the events in zpool events -v should know. (You might need a debug build for it to report that, I never remember.)

My first suggestion would be, don't run 6.10 or 6.9 - even 2.2.5 doesn't claim 6.10 support, so let's back off to something earlier that's in the actual supported range and see if this keeps happening. I'd suggest pre-6.4, but just trying to pick something that people have been using for a while without screams.

My second question would be, does this go away if you change the parity or checksum computation to use a different implementation? (e.g. given:

# grep . /sys/module/{icp,zcommon,zfs}/parameters/{zfs_fletcher_4_impl,icp_aes_impl,icp_gcm_impl,zfs_vdev_raidz_impl} 2>/dev/null
/sys/module/icp/parameters/icp_aes_impl:cycle [fastest] generic x86_64 aesni
/sys/module/icp/parameters/icp_gcm_impl:cycle [fastest] avx generic pclmulqdq
/sys/module/zcommon/parameters/zfs_fletcher_4_impl:[fastest] scalar superscalar superscalar4 sse2 ssse3 avx2
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle [fastest] original scalar sse2 ssse3 avx2

picking something other than "fastest" for zfs_fletcher_4_impl, like superscalar4, which does not use any SIMD instructions, and seeing if the problem stops occurring.

I should note, by nature of doing this, it's going to be substantially slower at checksums, so your scrubs may take longer and much more CPU time, but, if we want to eliminate weird bugs like that, ...)

Intel had a fun bug in Sapphire Rapids chips recently where virtualization played badly with SIMD instructions, so I wouldn't be totally astonished if AMD turned up something strange in the same vein, though I don't know of one offhand.

My third question would be, what exactly is every physical connection step between the disks and any SATA/SAS/USB controllers involved? Because the last time I personally had a weird problem with checksum errors cropping up weirdly and going up over time, it was because something in my controller setup was masking a bunch of write errors, so they only showed up as errors when I tried to actually read them back later, and, naturally, they were not what I had expected. (Moving the disks to a different machine and controller temporarily made this very clear.) That was all a USB->SATA setup, so I don't think it's the specific case here, but just to point out, getting checksum errors back can be what happens if you failed to write something, somehow.

My naive guess, since it's not reporting any uncorrectable errors at the moment, is that somehow, occasionally a write isn't getting to disk on one or two disks, and isn't properly bubbling up an error somehow, but since it's P=2, you've got enough redundancy to recover from it when you notice it later.

rincebrain avatar Aug 16 '24 20:08 rincebrain

So, if it were me, I'd make sure I had backups of everything on there (not because of some urgent risk I haven't mentioned, just on principle before experimenting), run two or three scrubs without changing anything else outside of the VM and see if the checksum error count keeps going up each time (the first time wouldn't be surprising, it's the second and third time that are more interesting).

Capture the errors from zpool events -v that crop up when it errors, share them (if there's any filenames in them feel free to delete them, they shouldn't matter). (Note that zpool events -v doesn't persist after reboot, so capture it then.)

If they didn't keep going up, we dig more into why the virtualization is making things odd; if they did, then we dig into why that, probably by me suggesting you do superscalar4 on fletcher_4_impl and doing a few scrubs again, which will likely take a bit.

(And if you have an older kernel than 6.9 to conveniently try, I'd try it after 2 scrubs outside of the VM but before changing any other settings - and then do a few scrubs again in that and see if it turns out differently.)

rincebrain avatar Aug 16 '24 20:08 rincebrain

Oh right, I thought I vaguely remembered one.

And it's for Zen 2, isn't that fun. #14557

So maybe a newer microcode would be a good time for you. (You could also test by booting with noxsaves and changing nothing else, and seeing if the problem goes away...)

rincebrain avatar Aug 16 '24 22:08 rincebrain

thank you so much for this wall of help!

ill first do a few scrubs and see if the errors increase (still on the HV itself)

zpool events -v: (snip)

Aug 17 2024 11:26:59.489108558 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0xa8643cdeea300401
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x46e14f483fbb48c0
                vdev = 0x9c76b56715dbd3b
        (end detector)
        pool = "Data"
        pool_guid = 0x46e14f483fbb48c0
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x9c76b56715dbd3b
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-ST18000NM003D-3DL103_ZVT0SCR9-part1"
        vdev_devid = "ata-ST18000NM003D-3DL103_ZVT0SCR9-part1"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x8a8643ae5e04
        vdev_delta_ts = 0x12b736
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x2
        vdev_delays = 0x0
        parent_guid = 0xf8dd7a25055d838
        parent_type = "raidz"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x0
        zio_flags = 0x1000b0
        zio_stage = 0x400000
        zio_pipeline = 0x3e00000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x4
        zio_offset = 0xbae41798000
        zio_size = 0x2b000
        zio_objset = 0x283
        zio_object = 0x11cdb
        zio_level = 0x0
        zio_blkid = 0xb59
        bad_ranges = 0x560 0x580 
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x4d 
        bad_range_clears = 0x3b 
        bad_set_bits = 0x92 0x10 0xc 0x25 0x81 0x80 0xaf 0x32 0xe6 0x84 0x21 0x8c 0x23 0x80 0xb9 0x50 0xc0 0x1 0x2 0x0 0xa0 0x3 0x8c 0xcb 0x18 0xc0 0x0 0x3 0x1c 0xb3 0x80 0x21 
        bad_cleared_bits = 0x0 0xc0 0x11 0x90 0x46 0x20 0x50 0x0 0x0 0x0 0xc 0x61 0x4 0xf 0x0 0xa4 0x5 0x72 0x95 0x19 0x44 0x80 0x60 0x0 0x1 0x9 0x28 0x10 0xa2 0xc 0x4c 0xa 
        time = 0x66c06ce3 0x1d27344e 
        eid = 0x17

Aug 17 2024 11:26:59.489108558 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0xa8643ce597102401
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x46e14f483fbb48c0
                vdev = 0xba078544e9c69f80
        (end detector)
        pool = "Data"
        pool_guid = 0x46e14f483fbb48c0
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0xba078544e9c69f80
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-ST18000NM003D-3DL103_ZVTAEL6D-part1"
        vdev_devid = "ata-ST18000NM003D-3DL103_ZVTAEL6D-part1"
        vdev_ashift = 0x9
        vdev_complete_ts = 0x8a8643add8d6
        vdev_delta_ts = 0x124079
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x4
        vdev_delays = 0x0
        parent_guid = 0xf8dd7a25055d838
        parent_type = "raidz"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x0
        zio_flags = 0x1000b0
        zio_stage = 0x400000
        zio_pipeline = 0x3e00000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_priority = 0x4
        zio_offset = 0xbae417ed000
        zio_size = 0x2b000
        zio_objset = 0x283
        zio_object = 0x11cdb
        zio_level = 0x0
        zio_blkid = 0xb5a
        bad_ranges = 0x14560 0x14580 
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0x4b 
        bad_range_clears = 0x39 
        bad_set_bits = 0x14 0xa0 0xe1 0x1 0x30 0x82 0x93 0x49 0xc0 0x20 0x34 0x82 0x41 0x34 0xd0 0x20 0x6d 0x69 0x40 0xa0 0x91 0x44 0x9 0x20 0x9b 0xa0 0x1 0x90 0x58 0x85 0x2 0x40 
        bad_cleared_bits = 0x1 0x14 0x12 0x8c 0x0 0x4 0x4 0x4 0x2c 0x89 0xc1 0x0 0x82 0x0 0x20 0x44 0x0 0x86 0x2 0x1c 0x42 0x0 0x90 0x13 0x24 0x5a 0x18 0x22 0x7 0x2 0x8 0x26 
        time = 0x66c06ce3 0x1d27344e 
        eid = 0x18

if you need more of the log i'll upload it

I'll see if i can get back to kernel 6.4 or if this will break my ARC gpu support. Microcode update should have already happened, I'll check.

For whats involved between disks and system:

8 seagate x20 going directly to the 8 onboard SATA ports on the ASROCK Phantom Gaming Pro 4, tried diffrent Cables, all new ones

For power: currently one cable from the PSU to 5 disks and another cable to the others with a sata-power-split cable The previous PSU had enough sata ports to not need a splitter but also errored

Anything i forgot?

PS

I would also remark that, from memory, AMD doesn't explicitly promise all the ECC functionality works on consumer boards, that's more of a "we don't stop board vendors from enabling it after testing it, good luck", so it's not impossible that there's still ECC errors that aren't being reported properly, though I wouldn't put that as the most likely answer without more data.

Oh nice to know. I'll prob. get server HW on the next iteration.

KoffeinKaio avatar Aug 17 '24 12:08 KoffeinKaio

multiple scrubs - errorcount keeps rising

will test micocode update and older kernel next

KoffeinKaio avatar Aug 18 '24 14:08 KoffeinKaio

Did you try the noxsaves kernel parameter? That'd be my suggestion as most likely to help.

rincebrain avatar Aug 18 '24 19:08 rincebrain

Did you try the noxsaves kernel parameter? That'd be my suggestion as most likely to help.

I did just now and after an hour of scrub the first error appeared (I did a zpool clear before). Should I do a second scrub after to see if they keep appearing?

KoffeinKaio avatar Aug 19 '24 08:08 KoffeinKaio

I'd let the first one finish and then run a second one, to be sure - since the problem there is "it screwed up a calculation sometimes", if it is that problem, it's possible these are actual incorrect things written out, the first time.

The second pass is the more interesting question.

rincebrain avatar Aug 19 '24 09:08 rincebrain

Second one also errors

Fun thing: now on every disk for the first time

KoffeinKaio avatar Aug 19 '24 13:08 KoffeinKaio

ZFS verifies checksums per block. If the block is not trivially small, it may be difficult to say where the error actually happened.

PS: Rather than running scrubs in a loop collecting errors I would run good long memory test.

amotin avatar Aug 19 '24 13:08 amotin

Did a 10h memory test with the old ram New ram (unbuf ECC) is new

KoffeinKaio avatar Aug 19 '24 15:08 KoffeinKaio

did a bios and with it - microcode update - no changes.

Still working on getting my setup to run on 6.1 or 6.3 kernel.

Should I still change the algorythm zfs uses to calculate checksums? would i have to rewrite the data on the disks?

KoffeinKaio avatar Aug 22 '24 10:08 KoffeinKaio

Should I still change the algorythm zfs uses to calculate checksums?

Yes, that would be great to rule out any FPU/SIMD related problems. Assuming you're using fletcher4 checksums and no encryption

echo scalar > /sys/module/zfs/parameters/zfs_fletcher_4_impl
echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl

should do the trick.

would i have to rewrite the data on the disks?

No, by doing the above you're just changing the implementation of the algorithm used. The algorithm itself is still defined by the datasets checksum property.

AttilaFueloep avatar Aug 23 '24 11:08 AttilaFueloep

(unbuf ECC) is new

New memory can have bit errors, happened to me once and your symtoms sound familiar. Don't assume, verify with proof: memtest the new RAM.

MartenBE avatar Aug 26 '24 21:08 MartenBE

Should I still change the algorythm zfs uses to calculate checksums?

Yes, that would be great to rule out any FPU/SIMD related problems. Assuming you're using fletcher4 checksums and no encryption

echo scalar > /sys/module/zfs/parameters/zfs_fletcher_4_impl
echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl

should do the trick.

would i have to rewrite the data on the disks?

No, by doing the above you're just changing the implementation of the algorithm used. The algorithm itself is still defined by the datasets checksum property.

THAT did it! two rows with zero errors in a row. Is this persistent?

Also where could the underlaying error be? I doubt the HW a bit - as it was running in truenas before without errors and also with vfio

Zfs version? Kernel version? Both?

KoffeinKaio avatar Aug 29 '24 08:08 KoffeinKaio

It'll reset on reboot, but if changing that helped you, you have SIMD issues.

I'd still suspect an interaction with that Zen 2 erratum, absent more data, since there's not been a flood of 500+ people complaining about this. I'll go try to work up a patch that'd provide more useful information and hooks for debugging if I get a chance.

rincebrain avatar Aug 29 '24 16:08 rincebrain

Here, this patch will log something like:

[  232.980881] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 1 xsaveopt: 1 xsave: 1 fxsr: 1

on module load.

https://github.com/rincebrain/zfs/commit/07cd079f7a6ea4b5f3526ab8ebf04a786245a53a.patch

I've been sketching something like this so that people don't have to apply debugging patches to get this information out live, but haven't gotten back to it in a bit.

The key interesting point, to me, would be if you could build with this patch, then try loading it once with and without noxsaves set as a kernel parameter, because my suspicion would be something about that isn't applying to our check.

But that's just a guess.

rincebrain avatar Aug 29 '24 18:08 rincebrain

without noxsaves:

[    7.039568] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 0 xsaveopt: 1 xsave: 1 fxsr: 1 

with noxsaves: cmdline:

[    7.418467] SIMD debug: sse: 1 sse2: 1 sse3: 1 ssse3: 1 sse41: 1 sse42: 1 avx: 1 avx2: 1 bmi1: 1 bmi2: 1 aes: 1 pclmulqdq: 1 movbe: 1 shani: 1 avx512f: 0 avx512cd: 0 avx512er: 0 avx512pf: 0 avx512bw: 0 avx512dq: 0 avx512vl: 0 avx512ifma: 0 avx512vbmi: 0 neon: 0 sha256: 0 sha512: 0 altivec: 0 vsx: 0 isa207: 0 xsaves: 0 xsaveopt: 1 xsave: 1 fxsr: 1 

(noxsaves on both the Proxmox and VM cmdlines)

KoffeinKaio avatar Aug 30 '24 23:08 KoffeinKaio

Cute, so it was already set to 0, I would guess because the kernel detected the same sort of concern that I mentioned.

That makes this all the more fascinating, I suppose. Because that erratum is explicitly only supposed to be if we use xsaves, and that should never happen with these settings.

I believe the term of art here is "uh-oh."

If random things aren't crashing with the scalar setting (try openssl speed in a loop for a bit and check that, though), then it's likely OpenZFS who is somehow interacting with something so that save/restore is turning out surprisingly, but I'm not immediately sure how, since that setting very explicitly will stop us from trying xsaves, and that erratum explicitly claims it's only that instruction.

...of course, something something theory and practice.

rincebrain avatar Aug 30 '24 23:08 rincebrain

If I were guessing here, also, Intel had a bug relatively recently where SIMD state didn't get saved/restored properly across VMENTER/VMEXIT.

So stopping all VMs on the system then running a couple scrubs with the tunables set to non-scalar values (e.g. the defaults) might also be informative.

(I tried, but couldn't reproduce this on my Zen 3 system with the same kernel and ZFS version, even with VMs running, so I don't think it's a quirk of Linux 6.10 or something...)

rincebrain avatar Aug 31 '24 03:08 rincebrain

Well, as long as running on bare metal, zfs' own calculations shouldn't be impacted by any XSAVE bug. SIMD usage in zfs does not depend on SIMD state. All SIMD usage is along the lines of

  1. disable interrupts and preemption
  2. save SIMD state
  3. do the calculations in one go
  4. restore SIMD state
  5. enable interrupts and preemption

So any bug in the save/restore procedure would "just" clobber the registers of other ppl, not our own ones.

If running virtualized zfs of course depends on the HV presenting a consistent SIMD state.

@KoffeinKaio It would be interesting to see if you can run mprime -t for a couple of hours on bare metal, ideally with the zfs modules unloaded or at least with the fletcher and raidz implementations set to scalar. mprime -t is really good at stress testing SIMD operations and will exit on the first error it encounters. Be warned though, it will peg your CPU at 100%. This way we could exclude the possibility of a flaky SIMD unit.

If it passes on bare metal you could repeat the test on a VM as well. This would ensure that the HV is handling SIMD state correctly.

AttilaFueloep avatar Sep 04 '24 15:09 AttilaFueloep

@KoffeinKaio BTW, you can make the fletcher/raidz settings permanent by creating a file in /etc/modprobe.d and adding appropriate entries to it. Off the top of my head that would be

options zfs zfs_fletcher_4_impl=scalar
options zfs zfs_vdev_raidz_impl=scalar

AttilaFueloep avatar Sep 04 '24 15:09 AttilaFueloep

will see if i can find time running mprime - unloading zfs might not be possible unless i boot some rescue

I can - but only if really neccessary. will try on the HV without unloading first

KoffeinKaio avatar Sep 04 '24 20:09 KoffeinKaio

Running mprime -t with the fletcher/raidz implementations set to scalar should be fine. Let's see how it goes.

AttilaFueloep avatar Sep 04 '24 21:09 AttilaFueloep

Running without problems for like 10 to 11 hrs

KoffeinKaio avatar Sep 05 '24 08:09 KoffeinKaio

Just double checked and it seems my memory served me wrong., mprime will not exit on failures. Can you please inspect the results.txt file located in the directory you where in when you started mprime? grep -E 'ERROR|failure' results.txt shouldn't show up anything.

AttilaFueloep avatar Sep 05 '24 13:09 AttilaFueloep

shouldn't show up anything.

well...

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 16K FFT size, consult stress.txt file.
FATAL ERROR: Final result was 9BC4E89E, expected: 62266123.
Hardware failure detected running 16K FFT size, consult stress.txt file.

KoffeinKaio avatar Sep 05 '24 14:09 KoffeinKaio

That's unfortunate. The mprime failure is a strong indication of a hardware issue. The next step would be to narrow down which part of the hardware misbehaves. This isn't an easy task though.

AttilaFueloep avatar Sep 05 '24 15:09 AttilaFueloep

I'd love if you have some ideas on how to do that. Otherwise I'll test the default stuff the internet says - give the cpu a tiny bit mor evoltage, test RAM or in the end, buy a new CPU (if it is only the cpu?).

KoffeinKaio avatar Sep 05 '24 19:09 KoffeinKaio

"Rounding error" almost certainly means the CPU, yes, and unless you changed the power from stock before, I would be surprised if tweaking the voltages reliably improved your outcomes.

For shits and giggles, what do the "microcode" lines in /proc/cpuinfo say?

rincebrain avatar Sep 05 '24 19:09 rincebrain