zfs icon indicating copy to clipboard operation
zfs copied to clipboard

permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload

Open aerusso opened this issue 3 years ago • 86 comments

System information

Type Version/Name
Distribution Name Debian
Distribution Version testing (bullseye)
Linux Kernel 5.10.19
Architecture amd64
ZFS Version 2.0.3-1

Describe the problem you're observing

After upgrading to zfs 2.0.3 and Linux 5.10.19 (from 0.8.6), a well-tested syncoid workload causes "Permanent errors have been detected in the following files:" reports for a pool/dataset@snapshot:<0x0> (no file given).

Removing the snapshot, and running a scrub causes the error to go away.

This is on a single-disk nvme SSD---never experiencing any problems before upgrading---and has happened twice, once after each reboot/re-running of the syncoid workload. I have since booted back into 0.8.6, ran the same workload, and have not experienced the error report.

Describe how to reproduce the problem

Beyond the above, I do not have a mechanism to reproduce this. I'd rather not blindly do it again!

Include any warning/errors/backtraces from the system logs

See also @jstenback's report of very similar symptoms: 1 and 2, which appear distinct from the symptoms of the bug report they are in. Additionally, compare to @rbrewer123's reports 1 and 2, which comes with a kernel panic---I do not experience this.

My setup is very similar: I run a snapshot workload periodically, and transfer the snapshots every day to another machine. I also transfer snapshots much more frequently to another pool on the same machine.

If valuable, I have zpool history output that I can provide. Roughly, the workload looks like many snapshot, send -I, destroy (on one pool) and receive (on the same machine, but another pool ).

aerusso avatar Mar 04 '21 02:03 aerusso

@behlendorf I'm trying to dig into this a little bit further. I want to rule out in-flight corruption of snapshot data, so I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error---and then actually read out the corrupted data. ~~Is there a better way than just putting a printk into spa_log_error (after avl_inser(tree, new, where); that should get called exactly once per affected block, right?).~~ I see that ZED will print this out. My only reproducer is my main desktop, so I want to be very careful.

EDIT: Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

Also, is this approach reasonable? I would think it would be helpful to know if the affected block is the meta_dnode, or a root block, etc., right? Or am I embarking on a wild goose chase?

aerusso avatar Mar 07 '21 16:03 aerusso

I'd like to be able to get access to zb_blkid and zb_level in the zbookmark_phys associated with this error

zpool events -v should also have this

Can I just use zdb -R pool 0:$blkid to dump the contents of this block? (There's only one vdev) I.e., is the "offset" in zdb -R the same as the zb_blkid of zbookmark_phys?

No, zb_blkid is the logical block ID, which is unrelated to the location on disk. You need the DVA's offset, which is also included in the zpool events -v output.

ahrens avatar Mar 15 '21 20:03 ahrens

For the record, I'm still seeing the same behavior that I reported in the issue @aerusso linked above. And I just for the first time since this started (right after updating to 2.0) saw a kernel panic that left my zfs filesystems unresponsive. Here's what I found in dmesg at that time:

[380231.914245] #PF: supervisor write access in kernel mode
[380231.914310] #PF: error_code(0x0002) - not-present page
[380231.914373] PGD 0 P4D 0 
[380231.914414] Oops: 0002 [#1] SMP PTI
[380231.914462] CPU: 3 PID: 1741442 Comm: zpool Tainted: P           OE     5.10.23-1-lts #1
[380231.914569] Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0 12/18/2015
[380231.914664] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.914687] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.914762] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.914786] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.914818] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.914848] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.914879] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.914910] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.914942] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.914976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.915002] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.915032] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.915063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[380231.915094] Call Trace:
[380231.915161]  zfs_sa_setup.constprop.0+0x6a/0x90 [zfs]
[380231.915233]  zfs_obj_to_path+0x50/0xe0 [zfs]
[380231.915291]  ? dmu_objset_hold_flags+0x95/0xe0 [zfs]
[380231.915361]  zfs_ioc_obj_to_path+0x86/0xf0 [zfs]
[380231.915387]  ? strlcpy+0x2d/0x40
[380231.915452]  zfsdev_ioctl_common+0x71c/0x880 [zfs]
[380231.915525]  zfsdev_ioctl+0x53/0xe0 [zfs]
[380231.915550]  __x64_sys_ioctl+0x83/0xb0
[380231.915567]  do_syscall_64+0x33/0x40
[380231.915581]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[380231.915598] RIP: 0033:0x7fcadd568e6b
[380231.915612] Code: ff ff ff 85 c0 79 8b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 af 0c 00 f7 d8 64 89 01 48
[380231.915660] RSP: 002b:00007ffcdb0e43a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[380231.915682] RAX: ffffffffffffffda RBX: 000055e10285a560 RCX: 00007fcadd568e6b
[380231.915702] RDX: 00007ffcdb0e43e0 RSI: 0000000000005a25 RDI: 0000000000000003
[380231.915723] RBP: 00007ffcdb0e7ad0 R08: 0000000000000000 R09: 0000000000000000
[380231.915742] R10: 00007fcadd5e6ac0 R11: 0000000000000246 R12: 00007ffcdb0e7990
[380231.915762] R13: 00007ffcdb0e43e0 R14: 000055e102873440 R15: 0000000000002000
[380231.915783] Modules linked in: rpcsec_gss_krb5 xt_nat veth xt_MASQUERADE br_netfilter bridge stp llc target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod xt_recent ipt_REJECT nf_reject_ipv4 xt_multiport xt_comment xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle xt_CT xt_tcpudp iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c vboxnetflt(OE) vboxnetadp(OE) iptable_filter vboxdrv(OE) joydev mousedev usbhid intel_rapl_msr ipmi_ssif intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp
[380231.915824]  coretemp iTCO_wdt kvm_intel intel_pmc_bxt iTCO_vendor_support kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate intel_uncore pcspkr ixgbe igb mdio_devres i2c_i801 libphy mei_me i2c_smbus ast mdio mei i2c_algo_bit ioatdma lpc_ich dca wmi mac_hid acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) vboxvideo drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec syscopyarea sysfillrect sysimgblt fb_sys_fops vboxsf vboxguest crypto_user fuse drm agpgart nfsd auth_rpcgss nfs_acl lockd grace sunrpc nfs_ssc bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 mpt3sas crc32c_intel raid_class scsi_transport_sas xhci_pci xhci_pci_renesas [last unloaded: vboxdrv]
[380231.924269] CR2: 0000000000000028
[380231.925010] ---[ end trace 67d499cf5193f033 ]---
[380231.984744] RIP: 0010:sa_setup+0xc7/0x5f0 [zfs]
[380231.986002] Code: 7b 01 00 4d 89 8f 10 05 00 00 48 85 ff 0f 84 b4 00 00 00 4c 89 0c 24 e8 27 c5 55 c6 49 8b 87 20 05 00 00 4c 8b 0c 24 4c 89 e7 <4c> 89 48 28 49 c7 87 10 05 00 00 00 00 00 00 e8 c5 eb 55 c6 48 89
[380231.988429] RSP: 0018:ffffac37c9963bf8 EFLAGS: 00010296
[380231.989326] RAX: 0000000000000000 RBX: ffff8b12fb31d4e8 RCX: ffff8b08050e8001
[380231.990116] RDX: ffff8b05a847f910 RSI: 0000000000000003 RDI: ffff8b12fb31d508
[380231.990890] RBP: ffffac37c9963df8 R08: 0000000000000000 R09: ffff8b08050e8000
[380231.991657] R10: 0000000000000e80 R11: 0000000000000000 R12: ffff8b12fb31d508
[380231.992419] R13: 0000000000000002 R14: ffffac37c9963c38 R15: ffff8b12fb31d000
[380231.993182] FS:  00007fcadcf9a7c0(0000) GS:ffff8b10ffac0000(0000) knlGS:0000000000000000
[380231.993959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380231.994731] CR2: 0000000000000028 CR3: 00000002cb2b8003 CR4: 00000000003706e0
[380231.995509] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[380231.996285] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Unfortunately I'm not running a debug build here, so that stack is only of so much value, but wanted to share nonetheless in case it provides any insight into this issue.

jstenback avatar Mar 19 '21 14:03 jstenback

For the record, I just got another kernel crash dump with the same exact behavior and stack trace in dmesg as reported in my previous comment. The dmesg this time (maybe last time too, though that didn't make it into my earlier report) states that it's a kernel NULL pointer dereference.

jstenback avatar Mar 26 '21 05:03 jstenback

@jstenback I don't see how the problem you're describing is related to this issue. @aerusso is experiencing unexpected checksum errors, and you have a null pointer dereference. Unless I'm missing something, please file a separate issue report for this.

ahrens avatar Mar 26 '21 05:03 ahrens

@ahrens I'm experiencing the exact same symptoms that @aerusso is experiencing, in addition to the two null pointer dereferences I've seen over the past week or so. I figured there's a chance they're related, but that may of course not be the case at all.

jstenback avatar Mar 26 '21 06:03 jstenback

For reference, I experienced the corruption in this report after minutes of running 2.0.3. The total time I used 2.0.3 was probably less than 2 hours. I'm guessing that @jstenback has been running the kernel for hundreds of hours. It might be that I just had not yet experienced that symptom. (Of course, it's also possible it's an unrelated bug).

aerusso avatar Mar 26 '21 13:03 aerusso

That is correct, my uptime during both of the crashes I mentioned were on the order of a hundred hours. And I typically start seeing the corruption after about the same amount of uptime.

jstenback avatar Mar 26 '21 14:03 jstenback

@aerusso Can you also mention: what was the last 'good' version of ZFS you didn't experience the issue? Can be helpful to narrow down the search.

IvanVolosyuk avatar Mar 26 '21 22:03 IvanVolosyuk

@IvanVolosyuk Unfortunately, my last known good configuration is ZFS 0.8.6 and Linux 5.9.15 (and it's stable as a rock back here). I was also unsuccessful in reproducing the bug in a VM (using a byte-for-byte copy of the whole 1 TB nvme).

My current plan (once I can find a free weekend) is to try to bisect on the actual workstation exhibiting the bug. To complicate things, I'll have to do the bisection back here with Linux 5.9.15, since support for 5.10 wasn't added until very late in the release cycle.

aerusso avatar Mar 26 '21 23:03 aerusso

As I've noted in #12014, I'm running 2.0.2 (with Ubuntu 21.04, Linux 5.11.0) since 30th April and I haven't experienced any issues yet.

On my server with Debian Buster, Linux 5.10 and ZFS 2.0.3 (from backports), I've experienced the issue on 4 datasets, ~but not at the same time~, EDIT: two of them at the same time:

            zvm/secure/locker@autosnap_2021-05-05_15:00:30_frequently:<0x0>
            zvm/secure/plappermaul@autosnap_2021-04-30_20:00:30_hourly:<0x0>
            zvm/secure/sunshine-db@autosnap_2021-05-08_00:45:30_frequently:<0x0>
            zvm/secure/dovecote-root@autosnap_2021-05-08_00:45:30_frequently:<0x0>

What I've also noted in the other issue, is that after reverting back to 0.8.4, everything seemed ok. I've also managed to destroy the affected snapshots and multiple scrubs didn't detect any issues.

glueckself avatar May 09 '21 11:05 glueckself

I added 3f81aba76 on top of Debian's 2.0.3-8, and am tentatively reporting that I cannot reproduce this bug. I've been running for about 45 minutes now, without the permanent error (I used to experience this bug immediately upon running my sanoid workload, which at this point has run three times).

I would suggest that anyone already running 2.x consider applying that patch.

aerusso avatar Jun 05 '21 15:06 aerusso

Unfortunately my optimism was premature. After another two and a half hours, I did indeed experience another corrupted snapshot.

aerusso avatar Jun 05 '21 18:06 aerusso

After about 3.5 hours of uptime under Linux 5.9.15-1 (making sure this can be reproduced on a kernel supporting the known-good 0.8.6) with ZFS 3c1a2a945 (candidate 2.0.5 with another suspect patch reverted):

zpool events -v reveals

class = "ereport.fs.zfs.authentication"
ena = 0xb91dbe5624302801
detector = (embedded nvlist)
        version = 0x0
        scheme = "zfs"
        pool = 0xe9128a59c39360a
(end detector)
pool = "REDACTED"
pool_guid = 0xe9128a59c39360a
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
zio_objset = 0x83c4
zio_object = 0x0
zio_level = 0x0
zio_blkid = 0x0
time = 0x60c52c8d 0x7a6bfee 
eid = 0x116e

I failed to capture this information in my previous reports. I can reproduce this by trying to send the offending snapshot.

This dataset has encryption set to aes-256-gcm.

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum? My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written? Should I try changing any encryption settings?

# cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] avx generic pclmulqdq

aerusso avatar Jun 12 '21 22:06 aerusso

More fun: I can zfs send the "corrupt" snapshot after rebooting the machine (both in 0.8.6 and the 2.0.5-ish that caused the problem; hopefully it doesn't matter, but I tested in that order). Is it possible that some cache is somehow being corrupted?

aerusso avatar Jun 12 '21 23:06 aerusso

Also, am I correct that there is some kind of MAC that is calculated before the on-disk checksum?

Yes, both AES-CCM and AES-GCM are authenticated encryption algorithms, which protect against tampering with the cipher text. Encrypting a block creates a MAC. The same MAC is created while decrypting. If they don't match decryption fails, generating an I/O error. Half of the block pointer checksum of an encrypted block is this MAC and the other half the checksum of the encrypted block.

My pool shows no READ/WRITE/CKSUM errors---does that mean that the data and/or the MAC was wrong before being written?

There were some problems with metadata(<0x0>) MACs related to ZFS user/group/project used on datasets created with certain master ZFS versions, but I'd need to lookup the details.

Should I try changing any encryption settings?

You could try to change icp_gcm_impl to generic to make sure the avx implementation didn't break something. Since it went into 0.8.4 I don't think this is likely though.

AttilaFueloep avatar Jun 12 '21 23:06 AttilaFueloep

@aerusso, to answer your questions from r/zfs (hope this is the right issue):

Are your affected datasets encrypted?

Yes. All of them are direct children of an encrypted parent and inherit its encryption.

Does this error coincide with a failed zfs send? Is there also a bunch of snapshots being taken at the same time?

Sanoid is taking snapshots of all datasets every 5 minutes. I can't find any log entry about sanoid failing to send it, however, manually running zfs send zvm/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly > /dev/null says cannot open 'pool/secure/sunshine-db@autosnap_2021-08-02_07:00:02_hourly': I/O error. Also, trying syncoid with the entire dataset results in cannot iterate filesystems: I/O error

(Here's a real hopeful question) Do you have a specific workload that consistently reproduces the bug? (I have to wait ~hours to reproduce it, which will make bisecting this tragically hard).

Not really. I've changed the syncoid cronjob yesterday to 5 minutes and then it happened. Have you tried reproducing it with really frequent snapshots and sending, something like while true; do uuid > /mnt/pool/dataset/testfile; zfs snapshot pool/dataset@$(uuid); done & while true; do sanoid ...; done? (the testfile write so that the snapshots have some content)

Can you grab a zpool events -v output relatively quickly after any such permanent errors occur? (you can also enable the all-debug.sh zedlet)? Any such output might be very useful.

If possible, please try rebooting, without deleting the snapshot, and accessing it (i.e., zfs send). Does the snapshot send without issue? If so, try scrubbing the pool twice (the pool doesn't forget the errors until there are two clean scrubs). This worked for me, by the way. This makes me suspect that there may be no on-disk corruption at all.

~~To answer the last two: no, after rebooting I get invalid argument when I try to zfs send one of the datasets.~~ Too bad I've read too late that this seems to be an issue with zfs writing to /dev/null (see #11445), so send would've worked probably. Scrubbing it twice resolved the corruptions, thanks!

~~I'm moving everything away from that server right now, afterwards I'll reboot, test that, try to clean it up, then start a zpool events -vf > somelogfile and see if my really-frequent-snapshots/sends-method can reproduce it. ~~ ~~I'm setting up a VM to play around, I don't want to break my hosts pool... ~~ ~~I'll let you know when I have something more. ~~

Ok, so I've had following stuff running for the last few hours: while true; do for i in $(seq 1 10); do zfs snapshot pool1/secure/test${i}@$(uuid); sleep 1; done; done while true; do for i in $(seq 1 10); do dd if=/dev/urandom of=/test/${i}/blafile bs=1M count=5 & done; sleep 1; done while true; do syncoid -r pool1/secure [email protected]:pool2/secure/t1-test-2; sleep 1; done without any corruption so far. This was running inside a VM, which used a SATA SSD, instead of the NVMe SSDs my normal VMs use. I can waste that SSD, so if you have any suggestions to stress test snapshots & sending, let me know.

glueckself avatar Aug 02 '21 15:08 glueckself

I have one more occurrence, however this time no sends were involved. When I woke up today, my backup server was offline and every I/O seemed to hang (i.e., after I type root + password it would hung and my services were offline). After a hard reboot, every active dataset had snapshot corruptions, which went away after scrubbing twice.

Also, in the dmesg I've got the following warning:

Aug 08 00:29:01 host.example.com kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Aug 08 00:29:01 host.example.com kernel: PANIC at arc.c:3790:arc_buf_destroy()
Aug 08 00:29:01 host.example.com kernel: Showing stack for process 482
Aug 08 00:29:01 host.example.com kernel: CPU: 0 PID: 482 Comm: z_rd_int Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:29:01 host.example.com kernel: Hardware name: <MANUFACTURER> <PRODUCT>, BIOS <VERSION> <DATE>
Aug 08 00:29:01 host.example.com kernel: Call Trace:
Aug 08 00:29:01 host.example.com kernel:  show_stack+0x52/0x58
Aug 08 00:29:01 host.example.com kernel:  dump_stack+0x70/0x8b
Aug 08 00:29:01 host.example.com kernel:  spl_dumpstack+0x29/0x2b [spl]
Aug 08 00:29:01 host.example.com kernel:  spl_panic+0xd4/0xfc [spl]
Aug 08 00:29:01 host.example.com kernel:  ? do_raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? __raw_spin_unlock.constprop.0+0x9/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? zfs_zevent_post+0x183/0x1c0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? cityhash4+0x8d/0xa0 [zcommon]
Aug 08 00:29:01 host.example.com kernel:  ? abd_verify+0x15/0x70 [zfs]
Aug 08 00:29:01 host.example.com kernel:  ? abd_to_buf+0x12/0x20 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_buf_destroy+0xe8/0xf0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  arc_read_done+0x213/0x4a0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_done+0x39d/0xdc0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  zio_execute+0x92/0xe0 [zfs]
Aug 08 00:29:01 host.example.com kernel:  taskq_thread+0x236/0x420 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? wake_up_q+0xa0/0xa0
Aug 08 00:29:01 host.example.com kernel:  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
Aug 08 00:29:01 host.example.com kernel:  kthread+0x12f/0x150
Aug 08 00:29:01 host.example.com kernel:  ? param_set_taskq_kick+0xf0/0xf0 [spl]
Aug 08 00:29:01 host.example.com kernel:  ? __kthread_bind_mask+0x70/0x70
Aug 08 00:29:01 host.example.com kernel:  ret_from_fork+0x22/0x30

One minute afterwards the snapshoting starts, and all the ZFS related tasks start hanging:

Aug 08 00:31:54 host.example.com kernel: INFO: task z_rd_int:482 blocked for more than 120 seconds.
Aug 08 00:31:54 host.example.com kernel:       Tainted: P          IO      5.11.0-25-generic #27-Ubuntu
Aug 08 00:31:54 host.example.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message

(with the call stack below, if you think it's relevant let me know and I can post it)

Affected tasks were also dp_sync_taskq and txg_sync which explains why the I/O was hanging (if I may guess z_rd_int is the read interrupt handler, txg_sync writes the transaction group to the disks).

I don't have the pool events, sorry for that.

EDIT: two more things to note. The system average load is about 0.5 (an old laptop running ~10 VMs) and it probably gets high memory pressure on the ARC. I had 8GB huge pages reserved for the VMs and 4GB zfs_arc_max, with 12GB RAM total - so ARC is going to have to fight with the host system (which is not much, the Linux kernel, libvirt and SSH server - I'd guess 100-200MB). I've now reduced the VM huge pages to 7GB, which should reduce the memory pressure.

glueckself avatar Aug 08 '21 08:08 glueckself

I have never had one of these kernel panics, so it may be better to put this in #12014 (which I see you've already posted in -- I'm also subscribed to that bug). The output of zpool status -v and zpool events immediately after a failed zfs send should give us a lot of diagnostic information.

It's reassuring that merely rebooting and scrubbing makes the invalid reads go away, but you may want to set up the ZED, enable all_debug.sh, and set the output directory to something that is permanent. Then, in the event of a crash, you can sift through the old zpool events.

aerusso avatar Aug 08 '21 13:08 aerusso

I can still reproduce this while running #12346.

aerusso avatar Aug 26 '21 14:08 aerusso

I'm going to try to bisect this bug over the course of the next ~months, hopefully. @behlendorf, are there any particularly dangerous commits/bugs I should be aware of lurking between e9353bc2e and 78fac8d92? Any suggestions on doing this monster bisect (each failed test is going to take about ~3 hours, each successful test probably needs ~6 hours to make sure I'm not just "getting lucky" and not hitting the bug)?

aerusso avatar Aug 27 '21 11:08 aerusso

Hey there, we're seeing this as well - not sure if I should put it here, or in #12014 .

System specs

  • TrueNAS 12.0-U5 (FreeBSD 12.2 + OpenZFS 2.0.5)
  • AMD Epyc 7F52 in an AsRockRack ROMED8-2T motherboard
  • 512GB RAM ECC
  • Chelsio T580 LP-CR with 2x 40G
  • 4x LSI 9305-16i controllers
  • 60x SAS-12G SSD; all drives enterprise 8TB and 16TB (56 in use)
  • SuperMicro 4U, 72-bay chassis with redundant PS & 3x BPN-SAS3-216A-N4 backplanes
    • No NVMe devices populated

History

  • Pool was rebuilt in late July/early August from backup as 9 6-disk RAIDZ2 vdevs with encryption, AES-128-GCM, with AMD hardware acceleration.
  • Data was replicated from the backup server using zfs send/recv.
  • Server ran for ~3 weeks without issue before we moved the server into production (acting primarily as NFS server; no VM or DB loads.)
  • Once it became active, we turn on our backup setup which does regular snapshots every 10min using pyznap, which are then zfs send/recv to the backup server.
  • Within 24 hours, we started seeing the corruption described in this ticket, causing whole machine crashes.
    • No kernel panics were observed, but that could be an artefact of the BMC not displaying the crashes.
    • The BMC shows no hardware failures, no memory errors, etc.
    • The disks report no errors via SMART
    • The zpool status command shows no checksum errors
    • We did see these errors on zpool status -v:
errors: Permanent errors have been detected in the following files:
        <0x21c49>:<0x0>
        Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x1>
        Speedy/[client]/Projects/foo@pyznap_2021-08-25_15:00:00_hourly:<0x20>
        <0x50a5>:<0x0>
        <0xa7d5>:<0x0>
        <0xa4de>:<0x0>
        <0x191e5>:<0x0>
        <0x191e5>:<0x3>
  • A scrub, interrupted by reboots, eventually reported this without resolving the errors: scan: scrub repaired 0B in 08:15:27 with 0 errors on Tue Aug 24 08:31:40 2021
  • We weren't able to confirm at that time that this was a ereport.fs.zfs.authentication failure because we were too busy with DR protocols
  • Once we removed the load from the system:
    • We mounted the pool under Ubuntu 21.04 and saw the same output
    • Back under TrueNAS, zpool events -v confirmed:
Aug 31 2021 20:14:44.609825904 ereport.fs.zfs.authentication
        class = "ereport.fs.zfs.authentication"
        ena = 0xca1cc9e0cba06c01
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x6909b6729e67dcf9
        (end detector)
        pool = "Speedy"
        pool_guid = 0x6909b6729e67dcf9
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        zio_objset = 0x65e7
        zio_object = 0x0
        zio_level = 0xffffffffffffffff
        zio_blkid = 0x0
        time = 0x612ec5f4 0x24593470
        eid = 0x136
  • After reading this ticket, we were then able to successfully complete a zpool scrub without a crash. The scrub / reboot / scrub process successfully eliminated the errors.
    • This didn't work earlier because crashes occurred so often, we could not complete a scrub process without a reboot in the middle.
  • As a result of reading this ticket we decided to rebuild the pool without encryption. Load has not yet been restored to the system in question.

wohali avatar Sep 01 '21 21:09 wohali

@wohali I'm assuming I just don't leave my machine running long enough to experience the crash. I have a some questions:

  1. Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?
  2. When not "in production" was the server primarily just idle?
  3. Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

I think it's very curious that this only seems to happen for people with flash memory. I wonder if it's some race in the encryption code that just doesn't really happen on slower media? Or, is it possible that this is just being caught because of the authentication (encryption), and is otherwise silently passing through for unencrypted datasets?

aerusso avatar Sep 02 '21 06:09 aerusso

  1. Can you describe the workload more precisely? I.e., approximately what kind of read/write/file creates are going on? Was the access was primarily/exclusively NFS server? Can you describe what the clients were doing?

Yes, exclusively an NFS server. It's /home directories, so access is varied. Under load it sees a constant 40Gbps of traffic.

2, When not "in production" was the server primarily just idle?

Yes

  1. Do you have any clean reproducer? (I know this is a stretch, but if you get me a reproducer I can do in a VM, I will fix this)

No, sorry.

wohali avatar Sep 02 '21 17:09 wohali

I also face this issue. My env is:

  • Debian Bullseye (Proxmox Backup Server)
  • Kernel 5.11.22-1-pve
  • ZFS 2.0.5
  • The pool is a single, simple vdev, built on top of hardware RAID10 10 SATA HDD (I know it's not the recommended setup, but I have reasons to do it this way)
  • Encryption is aes-256-gcm

The server is running Proxmox Backup Server and BackupPC to handle backups. The load vary (can be very busy during the night, a bit less during the day), but is mainly random read access. Sanoid is used to manage snapshots, and syncoid to replicate the data every hours to a remote location. The corrupted snap can be either those from sanoid, or from syncoid. Never had any crash though, errors disapear after two scrubs (but as scrub takes almost 3 days, most of the time, new corrupted snap appears during the previous scrub)

dani avatar Sep 03 '21 09:09 dani

Is it dangerous to create (and use) new encrypted pool/datasets using OpenZFS v2.1.1? Or is this bug related only to migrated pools from old (pre 2.x.x) pools?

isopix avatar Nov 13 '21 10:11 isopix

We're seeing similar issues with zfs snapshots on encrypted datasets that seem to run into permanent errors during syncoid runs with zfs authentication via other users.

For us there is no kernel panic involved, its simply the sends that break and permanent errors appear only in snapshots, never on the actual file system itself. Destroying the snapshots in question allows normal sync operations to resume and is otherwise unproblematic (aside from the fact that we've just lost an hourly snapshot). This is on Almalinux with zfs 2.0.7-1.

Issue appears every few hours on different datasets seemingly at random. Might also be caused by the send operations themselves not really sure how to debug this.

Blackclaws avatar Jan 20 '22 10:01 Blackclaws

For us there is no kernel panic involved, its simply the sends that break and permanent errors appear only in snapshots, never on the actual file system itself. Destroying the snapshots in question allows normal sync operations to resume and is otherwise unproblematic (aside from the fact that we've just lost an hourly snapshot). This is on Almalinux with zfs 2.0.7-1.

Issue appears every few hours on different datasets seemingly at random. Might also be caused by the send operations themselves not really sure how to debug this.

I'm experiencing exactly the same symptoms (i.e. no kernel panics just permanent errors detected in snapshots).

Upgrading from ZFS 2.0.X to 2.1.2 seems to have significantly reduced the frequency of the errors. This is on a Debian Bullseye (formerly Buster) system using znapzend to take automated snapshots of multiple encrypted datasets every 15-30 minutes. Deleting the snapshots and performing 2 scrubs clears the errors.

atj avatar Jan 20 '22 16:01 atj

Upgrading to zfs 2.1.2 and scrubbing has also cleared existing errors that would not clear before. When is zfs 2.1.2 going to be released as stable for RHEL based distros? Right now its only available as testing.

Blackclaws avatar Jan 27 '22 11:01 Blackclaws

I wonder if this issue is about to impact a lot more systems as admins upgrade over the next few months from Ubuntu LTS 20.04 (OpenZFS 0.8.3) to LTS 22.04 (OpenZFS v2.1.2, as of this writing). This issue is the reason I'm going to hold off on that move. I believe Debian 10 also had an unaffected version, I'm not sure what 11 comes with.

Maltz42 avatar Apr 15 '22 22:04 Maltz42