zfs continuous cksum errors after drive replaced

System information

Type	Version/Name
Distribution Name	OpenSUSE
Distribution Version	Leap 15.4
Kernel Version	5.14.21-150400.24.55-default
Architecture	x86 64 bit, AMD Ryzen 9 5950X
OpenZFS Version	zfs-2.1.9-1, zfs-kmod-2.1.9-1

Describe the problem you're observing

I have 2 zfs pools, "data" and "backup" Both pools have 6 drives. Data is 6 SSDs and raidz Backup is 6 HDDs and raidz2 One of the SSDs failed - it just stopped working completely. I replaced it with a brand new SSD (let's call this new one Drive A). zfs replace data wwn- wwn-<Drive A> waited for resilver to finish, no errors. I immediately started seeing cksum errors increasing on Drive A zpool clear data zpool scrub data waited for scrub to finish, corrected bad blocks, no errors cksum errors began to increase again on Drive A

Next action: took out the SATA cable that was connected to Drive A, replaced it with a new SATA cable. Connected the new SATA cable to the same SATA port on the motherboard, and connected the other end to an existing working drive with no errors (let's call this Drive B) Connected the SATA cable that was plugged into the Drive B into Drive A. So, now Dive A is on a known good SATA cable and known good SATA port (Drive B had been working there for around 3 years, 0 errors) Drive B is on a new SATA cable and plugged into the SATA port where Drive A used to be when Drive A was getting cksum errors. rebooted zpool scrub waited for scrub Immediately began seeing cksum errors on the Drive A again, although this drive is on a known good SATA cable and known good SATA port, and after scrub completed repairing data and showed 0 errors. 0 errors on Drive B.
Actually, 0 errors on 11 of the 12 drives - the original 5 drives that were part of zpool data, and the 6 drives that are part of zpool backup.

Next action: replaced Drive A with Drive C which was another spare SATA drive (out of a different system, but previously used with no errors). zpool replace data wwn-<Drive A> wwn-<drive C> waited for resilver to complete, no errors deleted all snapshots on zpool data cksum errors returned, on Drive C zpool clear zpool scrub waited for scrub to complete, bad blocks repaired, 0 errors zpool scrub a second time, no errors, no bad blocks. cksum errors returned, on Drive C

replaced Drive C with Drive D (another brand new SSD) zpool replace data wwn-<DriveC> wwn-<DriveD> waited for resilver to complete, no errors cksum errors on Drive D zpool clear data zpool scrub data scrub repaired bad blocks, completed with no errors cksum errors returned on Drive D

Shut down, booted memtest86, ran for a few hours, multiple passes, no errors

Hard to believe this is 3 bad SSDs, 1 previously used drive with no previous errors and 2 brand new SSDs zpool clear data zpool scrub data, completed, repaired bad blocks, no errors cksum errors returned on Drive C

During this whole time, the only errors are always on the drive replaced in zpool data.
zpool backup - lots of read/write activity, no errors whatsoever. zpool data gets cksum errors no matter what I've done so far since the original SSD failure

Next steps I intend to take: zpool destroy data delete partitions on 6 SSD drives that make up zpool data zpool create data raidz with the 6 SSDs restore files to zpool data monitor for cksum errors.

Describe how to reproduce the problem

Following a zpool scrub, wait a little while, and cksum errors return, always on the 1st drive listed in the pool with a zpool status command. This appears to me to be a potential bug in zfs.

It's hard to believe this is a drive issue. 3 different SSDs, 2/3 brand new, always get the same type of error. 11/12 drives get no errors.

It's hard to believe it's a SATA controller issue - 8 SATA ports on the motherboard, errors always of 1st drive listed of zpool data, even when on different SATA ports and different SATA cables.

It's hard to believe this is a RAM issue. 11/12 drives have no errors. memtest86 shows no errors. If RAM is flipping a bit, hard to believe it only does that when it's writing a block to only the 1st drive in zpool data, and never when writing a block to any of the other 11 drives.

My only conclusion at this point is that it's likely a big in zfs, but that's really just an educated guess at this point. I'm out of ideas of whatever else it could be.

Include any warning/errors/backtraces from the system logs

Apr 11 '23 00:04 VTBurtonRA

Did dmesg say anything? What compiler are you using?

Apr 11 '23 00:04 rincebrain

Other than the original drive failure, I don't believe there are any kernel messages. ZED does show the error. This is an example of the ZED message: zed: eid=347 class=checksum pool='data' vdev=wwn-0x-part1 size=6656 offset=3504283859968 priority=4 err=0 flags=0x1008b0 bookmark=0:0:0:0 OpenZFS is installed directly from the OpenSUSE repository: https://download.opensuse.org/repositories/filesystems/15.4/x86_64/

Apr 11 '23 03:04 VTBurtonRA

The output from zpool events -v for one of those checksum errors might be informative about whether they're random bitflips or something more complex.

Apr 11 '23 04:04 rincebrain

You say the SATA ports are all on the motherboard but are they all on the same SATA controller? A lot of boards get those high port counts by adding a sort of "value add" controller from Marvell. These controllers tend to have issues when using ncq and being fully loaded.

Apr 11 '23 09:04 KungFuJesus

The motherboard is an ASUS ROG Crosshair VIII Dark Hero, and according to the documentation: AMD X570 chipsent: 8 x SATA 6Gb/s ports The other 4 ports come from an expansion card, but the 4 drives connected to this card are 4 of the 6 zpool backup HDDs. lhsw command shows this for the expansion card, but again, the drive in question isn't connected to this card: 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller

zpool events appears to only show the events since the last reboot, and I haven't gotten the error since the last reboot. I'll post the output of the zpool events -v after the next cksum error.

Apr 11 '23 10:04 VTBurtonRA

The problem has gone away (at least for now - fingers crossed.) The last change I made was to turn off dedup and run a scrub. The scrub completed with 0B repaired and no errors.

Apr 12 '23 11:04 VTBurtonRA

What did you have the "dedup" setting set to?

Apr 12 '23 12:04 rincebrain

zfs set dedup=on data

That’s all I did so it would have taken whatever other defaults that are associated with dedup=on.

Apr 12 '23 14:04 VTBurtonRA

that would default to dedup=sha256. I don't know of any problems with that, to be clear, just as a data point.

What other settings did you have on the relevant datasets and pool? autotrim on the pool, compression settings, checksum settings, anything non-default, really.

Apr 12 '23 16:04 rincebrain

Compression is turned on (zfs set compression=on data). That's the only other property that I changed. The good news is that there are still no cksum errors on the pool.

Apr 13 '23 11:04 VTBurtonRA

Don't take the security of your data for granted. Having checksum issues with compressed data and not having issues with non-compressed data would have been more more logical (a software bug somewhere in the compression for from some more or less obscure reason). There is a red-flag here, this is abnormal. A de-duplication issue might be underneath however it might be a hardware issue as others suggested here-before.

So as a first step: backup everything you cannot afford to loose and double check with checksums that you have strictly identical non-corrupted files on "both sides",

The next step I would do is to memtest the RAM overnight (3-4 passes) and see if something is reported. This not an absolute truth but if some RAM cells "jiggles" you have a good chance of spotting that. If your RAM defective the de-duplication data can become corrupt so your pool... not cool.

PS: I do know your precise use case but in a general manner (to have talked to iX Systems guys for a storage appliance deployed at a previous workplace holding several hundreds of GB of research data) => de-duplication should not be used unless on some very specific cases. Not having to enable de-duplication on a big storage appliance tends to make think it is not required for a desktop machine.

PPS: Hardware issues are very tricky sometimes, never underestimate what can happen. Until you have strong reasons or evidences it is not the hardware. Did you overclock your CPU (PBO)/RAM?

EDITED: removed a technically inexact statement with regards to DDT and non-importable pools. Thank you @rincebrain for having spotted that issue in your comment below. I am not able to strike the original text so I removed it to not bring confusion.

Apr 21 '23 00:04 admnd

PPPS: if the DDT is unavailable, all your data can be read, the pool can just not be written to any more, since that would require knowing the number of references to know if you can free them...which I suppose one could try to synthesize by walking the metadata in its entirety.

Apr 21 '23 01:04 rincebrain

0 errors since I turned off dedup. I've replaced another drive in /data, resilver, 0 errors, I've run scrub 3 times (on both /data and /backup), no errors. zpool status shows 0 errors and "No known data errors". (scrub runs weekly on /data and monthly on /backup)

I can't prove it's not a hardware issue, but it seems extraordinarily unlikely to me. The same symptom on multiple SSDs with dedup on, 0 symptoms with dedup off on multiple drives.
memtest shows 0 errors over multiple passes over multiple hours.

All the "data" is in /data, and backed up automatically (multiple copies, multiple versions) to /backup. /backup is copied to the cloud. So, no real chance of ever losing data.

dedup was likely saving me some amount of disk space, but certainly not worth the headaches I've had with this. It's not a big deal having dedup turned off, and I'll just leave it that way. Compression is saving a great deal of disk space, but also is working fine.

I do not overclock. This system isn't CPU constrained - it just has a lot of data (> 50 TB total between /data and /backup). It does run VMWare and several VMs so lots of data is in VMDKs. It's also on a UPS so unless I'm doing maintenance it's never powered off.

The errors are gone and enough time and activity have occurred that it seems likely that if the issue was still lurking out there that I would have seen it by now. So, I'm going to close this with a big thank you to all of have responded.

Apr 29 '23 13:04 VTBurtonRA

I believe I've seen this issue too:

zfs-0.8.3-1ubuntu12.15
zfs-kmod-2.1.5-1ubuntu6~22.04.1
linux 5.15.0-78-generic

After zpool replace, files that I believe are deduplicated are no longer readable, and huge numbers of CKSUM errors.

I could NOT trivially reproduce the issue:

zpool create app scsi-3600224808ac74db600bb41d34cb9cc70
zfs set dedup=on app
cd /app
echo hi > file1
echo hi > file2
zpool list #2x dedup
zpool replace app scsi-3600224808ac74db600bb41d34cb9cc70 scsi-360022480cc59bedf51b7de3c14ec60c9
zpool list #2x dedup
zpool status
echo hi > file3
cat /app/file1
zpool list #3x dedup

Aug 07 '23 13:08 taylortails

I also experienced CKSUM errors increase after replacing a disk. It's already the second time - the first time I just replaced the whole server with a new one by using send/receive. Now I'm looking for a faster solution.

Here is a log from scrubbing a pool:

scan: scrub repaired 17.0M in 02:13:41 with 0 errors on Tue Aug  8 19:13:14 2023
scan: scrub repaired 1.54M in 02:13:20 with 0 errors on Wed Aug  9 02:06:04 2023
scan: scrub repaired 1.94M in 02:27:13 with 0 errors on Wed Aug  9 04:34:00 2023
scan: scrub repaired 16K in 02:17:11 with 0 errors on Wed Aug  9 16:19:48 2023
scan: scrub repaired 72K in 02:12:37 with 0 errors on Wed Aug  9 19:34:32 2023
scan: scrub repaired 16K in 02:12:41 with 0 errors on Wed Aug  9 23:43:39 2023
reboot
scan: scrub repaired 180K in 02:26:09 with 0 errors on Thu Aug 10 04:22:22 2023
scan: scrub repaired 40K in 02:25:27 with 0 errors on Fri Aug 11 04:23:00 2023

My config: dedup=on, compression=lz4, FreeBSD 13.2 (the previous issue experienced on FreeBSD 13.0).

Aug 13 '23 00:08 avkarenow

zfs zfs copied to clipboard

continuous cksum errors after drive replaced

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard