zfs icon indicating copy to clipboard operation
zfs copied to clipboard

ZFS Receive of encrypted incremental data stream causes a PANIC

Open cyberpower678 opened this issue 2 years ago • 36 comments

System information

Type Version/Name
Distribution Name Debian
Distribution Version Bullseye
Kernel Version 5.10.109+truenas
Architecture amd64
OpenZFS Version zfs-2.1.2-95_g1d2cdd23b zfs-kmod-2.1.2-95_g1d2cdd23b

Describe the problem you're observing

During an incremental receive, ZFS caused a panic and a system hangup.

Describe how to reproduce the problem

It happens randomly.

Include any warning/errors/backtraces from the system logs

[456678.240841] VERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 95)
[456678.243815] PANIC at dmu_recv.c:1776:receive_object()
[456678.245141] Showing stack for process 2936808
[456678.246532] CPU: 10 PID: 2936808 Comm: receive_writer Tainted: P           OE     5.10.109+truenas #1
[456678.247840] Hardware name: Supermicro X9QR7-TF+/X9QRi-F+/X9QR7-TF+/X9QRi-F+, BIOS 3.0b 05/20/2015
[456678.249138] Call Trace:
[456678.250421]  dump_stack+0x6b/0x83
[456678.251676]  spl_panic+0xd4/0xfc [spl]
[456678.253038]  ? arc_buf_access+0x14c/0x250 [zfs]
[456678.254276]  ? dnode_hold_impl+0x4e9/0xef0 [zfs]
[456678.255493]  ? dnode_set_blksz+0x13b/0x300 [zfs]
[456678.256677]  ? dnode_rele_and_unlock+0x5c/0xc0 [zfs]
[456678.257846]  receive_object+0xc2c/0xca0 [zfs]
[456678.258984]  ? dmu_object_next+0xd6/0x120 [zfs]
[456678.260098]  ? receive_writer_thread+0xbd/0xad0 [zfs]
[456678.261160]  ? kfree+0x40c/0x480
[456678.262202]  ? _cond_resched+0x16/0x40
[456678.263244]  receive_writer_thread+0x1cc/0xad0 [zfs]
[456678.264280]  ? thread_generic_wrapper+0x62/0x80 [spl]
[456678.265252]  ? kfree+0x40c/0x480
[456678.266242]  ? receive_process_write_record+0x190/0x190 [zfs]
[456678.267177]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[456678.268092]  thread_generic_wrapper+0x6f/0x80 [spl]
[456678.268988]  ? __thread_exit+0x20/0x20 [spl]
[456678.269864]  kthread+0x11b/0x140
[456678.270706]  ? __kthread_bind_mask+0x60/0x60
[456678.271538]  ret_from_fork+0x22/0x30

cyberpower678 avatar May 10 '22 19:05 cyberpower678

I would suggest trying a 2.1.4-based build, as there were fixes for encrypted send/recv and blocksize changing in the interim.

rincebrain avatar May 10 '22 23:05 rincebrain

I am experiencing the same issue. I copied the bug issue template to organize my response better, but since it is a duplicate I don't want to open a new issue.

System information

Type Version/Name
Distribution Name Debian
Distribution Version Bullseye
Kernel Version 5.10.0-14-amd64
Architecture amd64
OpenZFS Version zfs-2.1.4-1 zfs-kmod-2.1.4-1

Describe the problem you're observing

Same result - receiving system locks up in a ZFS panic.

Describe how to reproduce the problem

Happens reliably with both of my TrueNAS SCALE datasets. I vaguely remember it happening after I changed permissions of some folders, but nothing more concrete.

Source dataset is a dataset created by TrueNAS with no special manual changes/settings, it is fully encrypted. Destination dataset was created using:

openssl rand -out /key 32
zpool create -O encryption=aes-256-gcm -O keyformat=raw -O keylocation=file:///key pool /dev/sdb3
zfs create pool/test

Dataset was then sent to pool/test, resulting in a panic and lockup of the system.

If required, I can somehow try to provide the dataset that causes this (it is about 135GB in size and in active use), currently not sure how since zfs sending it is not an option... Encryption keys will of course not be provided.

Include any warning/errors/backtraces from the system logs

From my production server running ZFS 2.1.2 where this was found (Sorry for the image, not sure why I took a screenshot instead of copying the text): image

Reproduced in a virtual machine running ZFS 2.1.4:

[ 1712.885113] VERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 95)
[ 1712.885147] PANIC at dmu_recv.c:1784:receive_object()
[ 1712.885159] Showing stack for process 359011
[ 1712.885161] CPU: 2 PID: 359011 Comm: receive_writer Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1
[ 1712.885161] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1712.885162] Call Trace:
[ 1712.885168]  dump_stack+0x6b/0x83
[ 1712.885173]  spl_panic+0xd4/0xfc [spl]
[ 1712.885204]  ? arc_buf_access+0x14c/0x250 [zfs]
[ 1712.885231]  ? dnode_hold_impl+0x4e9/0xef0 [zfs]
[ 1712.885258]  ? dnode_set_blksz+0x13b/0x300 [zfs]
[ 1712.885283]  ? dnode_rele_and_unlock+0x64/0xc0 [zfs]
[ 1712.885316]  receive_object+0xc2c/0xca0 [zfs]
[ 1712.885342]  ? dnode_rele_and_unlock+0x64/0xc0 [zfs]
[ 1712.885368]  ? flush_write_batch+0x1de/0x560 [zfs]
[ 1712.885393]  receive_writer_thread+0x1cc/0xad0 [zfs]
[ 1712.885395]  ? kfree+0xba/0x480
[ 1712.885420]  ? receive_process_write_record+0x190/0x190 [zfs]
[ 1712.885423]  ? thread_generic_wrapper+0x6f/0x80 [spl]
[ 1712.885425]  thread_generic_wrapper+0x6f/0x80 [spl]
[ 1712.885428]  ? __thread_exit+0x20/0x20 [spl]
[ 1712.885430]  kthread+0x11b/0x140
[ 1712.885431]  ? __kthread_bind_mask+0x60/0x60
[ 1712.885432]  ret_from_fork+0x22/0x30

Additionally, attempting to delete the partially received dataset on the receiving side results in another ZFS panic:

[  173.583242] PANIC: zfs: adding existent segment to range tree (offset=3c2937b800 size=400)
[  173.583267] Showing stack for process 1342
[  173.583269] CPU: 3 PID: 1342 Comm: txg_sync Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1
[  173.583269] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  173.583270] Call Trace:
[  173.583275]  dump_stack+0x6b/0x83
[  173.583281]  vcmn_err.cold+0x58/0x80 [spl]
[  173.583284]  ? sysvec_apic_timer_interrupt+0x36/0x80
[  173.583286]  ? irq_exit_rcu+0x3e/0xc0
[  173.583287]  ? sysvec_apic_timer_interrupt+0x36/0x80
[  173.583288]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[  173.583317]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs]
[  173.583352]  zfs_panic_recover+0x6d/0x90 [zfs]
[  173.583384]  range_tree_add_impl+0x305/0xe40 [zfs]
[  173.583414]  metaslab_free_concrete+0x11d/0x250 [zfs]
[  173.583441]  metaslab_free_impl+0xa9/0xe0 [zfs]
[  173.583467]  metaslab_free+0x168/0x190 [zfs]
[  173.583501]  zio_free_sync+0xda/0xf0 [zfs]
[  173.583533]  zio_free+0xb3/0xf0 [zfs]
[  173.583563]  delete_blkptr_cb+0x22/0x110 [zfs]
[  173.583589]  bplist_iterate+0xc9/0x130 [zfs]
[  173.583617]  ? sublist_delete_sync+0x50/0x50 [zfs]
[  173.583644]  sublist_delete_sync+0x2e/0x50 [zfs]
[  173.583673]  dsl_sync_task_sync+0xa6/0xf0 [zfs]
[  173.583701]  dsl_pool_sync+0x40d/0x520 [zfs]
[  173.583730]  spa_sync+0x542/0xfa0 [zfs]
[  173.583731]  ? mutex_lock+0xe/0x30
[  173.583762]  ? spa_txg_history_init_io+0x101/0x110 [zfs]
[  173.583792]  txg_sync_thread+0x2e0/0x4a0 [zfs]
[  173.583821]  ? txg_fini+0x250/0x250 [zfs]
[  173.583824]  thread_generic_wrapper+0x6f/0x80 [spl]
[  173.583827]  ? __thread_exit+0x20/0x20 [spl]
[  173.583828]  kthread+0x11b/0x140
[  173.583829]  ? __kthread_bind_mask+0x60/0x60
[  173.583831]  ret_from_fork+0x22/0x30

Nemezor avatar May 15 '22 21:05 Nemezor

I experience the same issue on ZFS 2.1.4 on Debian Bullseye amd64 and on another TrueNAS system. This practically renders my ZFS setup useless because I cannot perform any snapshot based replication and/or backup. Is it something dataset specific that could be fixed? Is there a workaround for this? It's been nearly three months since this has been reported. 😢

schreiberstein avatar Aug 04 '22 11:08 schreiberstein

I experience the same issue on ZFS 2.1.4 on Debian Bullseye amd64 and on another TrueNAS system. This practically renders my ZFS setup useless because I cannot perform any snapshot based replication and/or backup. Is it something dataset specific that could be fixed? Is there a workaround for this? It's been nearly three months since this has been reported. 😢

You can definitely work around it, especially if you know the date of the last snapshot that was successfully sent. The idea is to find all files and folders that were created/modified since the last snapshot was sent, back it up, roll the dataset back, and then put the backed up files back on. Create a new incremental snapshot, and send it. It should work.

cyberpower678 avatar Aug 04 '22 12:08 cyberpower678

You can definitely work around it, especially if you know the date of the last snapshot that was successfully sent. The idea is to find all files and folders that were created/modified since the last snapshot was sent, back it up, roll the dataset back, and then put the backed up files back on. Create a new incremental snapshot, and send it. It should work.

Interesting approach! But sadly wouldn't work for me. I sent a fresh snapshot to a brand new system and after the transfer was completed, I wanted to send a newer (final) snapshot of all the data that was changed in the meantime to completely switch over to said new system. (Lots of small files, rsync not viable, indexing would take longer than the permissible weekend of downtime) However, no matter what I do, all incremental sends end in a PANIC. Also tried different snapshots. The original dataset got transferred a long time ago from a different TrueNAS system. The only thing that got changed was switching the encryption from passphrase to keyfile on the previous system (the system which I try to send from now) some time ago. There are dozens of snapshots and I cannot brute-force-style attempt all. My only hope would be to copy (send and receive) the dataset locally and pray that it may fix itself.

I wonder how this is possible. Userspace operations causing a literal kernel panic. I know everybody just focusses on issues / bugs that he experiences himself, but I think this bug is really, really serious. Literally crashed a TrueNAS storage server (reboot!) and causes a panic message in Linux. 😢

schreiberstein avatar Aug 05 '22 06:08 schreiberstein

The original bug appears to be #13699, people seem to be piling in various other things.

I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know.

rincebrain avatar Aug 05 '22 06:08 rincebrain

The original bug appears to be #13699, people seem to be piling in various other things.

I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know.

I get what you are saying... Do you think I could work around it by sending the dataset decrypted instead? I mean, complete an initial sync decrypted, then perform decrypted incremental send? Is it just the encryption that's causing it?

schreiberstein avatar Aug 05 '22 08:08 schreiberstein

I haven't dug into this bug, so I don't know if the flaw in this code is specific to encryption, but it doesn't seem like trying it can make things worse...

On Fri, Aug 5, 2022 at 4:56 AM Alexander Schreiber @.***> wrote:

The original bug appears to be #13699 https://github.com/openzfs/zfs/issues/13699, people seem to be piling in various other things.

I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know.

I get what you are saying... Do you think I could work around it by sending the dataset decrypted instead? I mean, complete an initial sync decrypted, then perform decrypted incremental send? Is it just the encryption that's causing it?

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13445#issuecomment-1206212235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7LOUN6B7YDL6GPIWQLVXTJLPANCNFSM5VSUYXLA . You are receiving this because you commented.Message ID: @.***>

rincebrain avatar Aug 05 '22 09:08 rincebrain

I haven't dug into this bug, so I don't know if the flaw in this code is specific to encryption, but it doesn't seem like trying it can make things worse... On Fri, Aug 5, 2022 at 4:56 AM Alexander Schreiber @.> wrote: The original bug appears to be #13699 <#13699>, people seem to be piling in various other things. I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know. I get what you are saying... Do you think I could work around it by sending the dataset decrypted instead? I mean, complete an initial sync decrypted, then perform decrypted incremental send? Is it just the encryption that's causing it? — Reply to this email directly, view it on GitHub <#13445 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7LOUN6B7YDL6GPIWQLVXTJLPANCNFSM5VSUYXLA . You are receiving this because you commented.Message ID: @.>

I sent an unencrypted (that is, not raw (-w)) version of the dataset and can confirm that sending incremental snapshots to it works flawlessly... Given that I have sent the dataset raw and encrypted twice and I was unable to send incremental snapshots, I conclude the issue lies with the encryption / sending "raw".

schreiberstein avatar Aug 15 '22 06:08 schreiberstein

I haven't dug into this bug, so I don't know if the flaw in this code is specific to encryption, but it doesn't seem like trying it can make things worse... On Fri, Aug 5, 2022 at 4:56 AM Alexander Schreiber @.> wrote: The original bug appears to be #13699 <#13699>, people seem to be piling in various other things. I am going to bite my tongue and only say that if you find a way to convince OpenZFS leadership that native encryption bugs are worth more than "I hope someone fixes it", please let me know. I get what you are saying... Do you think I could work around it by sending the dataset decrypted instead? I mean, complete an initial sync decrypted, then perform decrypted incremental send? Is it just the encryption that's causing it? — Reply to this email directly, view it on GitHub <#13445 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7LOUN6B7YDL6GPIWQLVXTJLPANCNFSM5VSUYXLA . You are receiving this because you commented.Message ID: _@**.**_>

I sent an unencrypted (that is, not raw (-w)) version of the dataset and can confirm that sending incremental snapshots to it works flawlessly... Given that I have sent the dataset raw and encrypted twice and I was unable to send incremental snapshots, I conclude the issue lies with the encryption / sending "raw".

Actually it’s an issue with receiving encrypted streams. The send function works as expected.

If you downgraded the receiver to 2.0.2 it would probably receive just fine.

cyberpower678 avatar Aug 15 '22 06:08 cyberpower678

The original bug appears to be #13699, people seem to be piling in various other things. …

Tentatively (without linking): might https://github.com/openzfs/zfs/pull/14119 partially fix this issue #13445?

Deny receiving into encrypted datasets if the keys are not loaded …

Postscript: hidden as off-topic, given the response from @AttilaFueloep (thank you).

grahamperrin avatar Nov 01 '22 23:11 grahamperrin

Tentatively (without linking): might https://github.com/openzfs/zfs/pull/14119 partially fix this issue #13445?

I don't think so. #14119 is fixing an issue where receiving non-raw unencrypted incremental sends into an encrypted target without the key loaded where accidentally allowed, causing havoc down the stack.

AttilaFueloep avatar Nov 02 '22 00:11 AttilaFueloep

I ran into this issue on FreeBSD 12.2-RELEASE-p14 with zfs-2.0.7-1 (TrueNAS-12.0-U8.1). The source system does a raw send (i.e. zfs send -Rw) as an incremental update with intermediary snapshots (-I). Properties recordsize=128K, compression=lz4, encryption=aes-256-gcm are set on the source dataset, so it sends raw compressed and encrypted blocks to the target system. The target system does not have the encryption key in this case, it should just store the encrypted blocks and never mount it. The target system (running the same release) panics during zfs receive:

panic: VERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 45)

cpuid = 7
time = 1654421418
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe017d8ae880
vpanic() at vpanic+0x17b/frame 0xfffffe017d8ae8d0
spl_panic() at spl_panic+0x3a/frame 0xfffffe017d8ae930
receive_object() at receive_object+0x99f/frame 0xfffffe017d8aea40
receive_writer_thread() at receive_writer_thread+0x115/frame 0xfffffe017d8aeab0
fork_exit() at fork_exit+0x7e/frame 0xfffffe017d8aeaf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe017d8aeaf0
--- trap 0xc, rip = 0x800876caa, rsp = 0x7fffffff7508, rbp = 0x7fffffff7570 ---
KDB: enter: panic
panic.txt06000017714247073652  7237 ustarrootwheelVERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 45)
version.txt0600006314247073652  7544 ustarrootwheelFreeBSD 12.2-RELEASE-p11 75566f060d4(HEAD) TRUENAS

The value returned by dmu_object_set_blocksize() on FreeBSD is EOPNOTSUPP:

     45 EOPNOTSUPP Operation not supported.  The attempted operation is not supported for the type of object referenced.  Usually this occurs when
             a file descriptor refers to a file or socket that cannot support this operation, for example, trying to accept a connection on a
             datagram socket.

On Debian, it's 95 (consistent with the panic in the original description of this issue).

#define EOPNOTSUPP      95      /* Operation not supported on transport endpoint */

It's hitting this VERIFY statement, which is only called when raw send is enabled: https://github.com/openzfs/zfs/blob/b9d98453f9387c413f91d1d9cdb0cba8e04dbd95/module/zfs/dmu_recv.c#L1783 EOPNOTSUPP appears to be coming from dnode_set_blksz: https://github.com/openzfs/zfs/blob/b9d98453f9387c413f91d1d9cdb0cba8e04dbd95/module/zfs/dnode.c#L1814 I'll post more details if I can narrow it down to a simple test case, I don't know yet exactly how to reproduce it in isolation.

dobsonj avatar Nov 12 '22 00:11 dobsonj

Hi, I just wanted to report, I think I'm having the issue. Has there been any progress in tackling this issue? I'm having trouble figuring out how I'm going to make a backup of my main pool, since replication always fails. I can post the dmesg log I get. dmesg_zfs.txt

TBhatti90 avatar Jan 25 '23 02:01 TBhatti90

Hi, I just wanted to report, I think I'm having the issue. Has there been any progress in tackling this issue? I'm having trouble figuring out how I'm going to make a backup of my main pool, since replication always fails. I can post the dmesg log I get. dmesg_zfs.txt

Are you sending an incremental snapshot, or the initial snapshot?

If the former, identify which files were changed since the last snapshot, copy them out of the dataset/pool, delete them from the pool, put them back, delete the newest snapshot, and make a new one. Replication should work now.

If the later, create a new dataset of the dataset failing to replicate and copy all of the files from the old set to the new one. Delete old dataset once data is copied.

cyberpower678 avatar Jan 25 '23 02:01 cyberpower678

Hold on, I'm confused about your former option. That's the problem. It's an incremental snapshot, a series of them actually. "I copy the diff files from the newest snapshot, delete them from the newest snapshot, put them back in the newest snapshost, delete the newest snapshot, make a new snapshot". This is weird. Why is it like this? This isn't how sending and replication is supposed to behave. What's going on? I'm using sanoid for my snapshot management, so it creates automatic snapshots at a set time frame, like daily, or monthly. Should I delete every snapshot, except the initial one, that's the same on the server and the backup?

TBhatti90 avatar Jan 26 '23 07:01 TBhatti90

Hold on, I'm confused about your former option. That's the problem. It's an incremental snapshot, a series of them actually. "I copy the diff files from the newest snapshot, delete them from the newest snapshot, put them back in the newest snapshost, delete the newest snapshot, make a new snapshot". This is weird. Why is it like this? This isn't how sending and replication is supposed to behave. What's going on? I'm using sanoid for my snapshot management, so it creates automatic snapshots at a set time frame, like daily, or monthly. Should I delete every snapshot, except the initial one, that's the same on the server and the backup?

I’m venturing a guess your affected dataset is an older dataset from an older ZFS version. On newer versions of ZFS, some dataset metadata left behind by older versions don’t seem to play nice with the new versions upon sending and receiving them. The idea here is to go back as many snapshots as needed, to get to the point where the last snapshot sent was still working on the backup. Make a note of every file that’s been modified, or created on the dataset since the last working snapshot, and copy them out of the pool. Roll back the source dataset to the last snapshot that sent successfully to the backup. Do the same for the destination. Delete all the files on source that you copied out, that still existed at that point. Copy files back into the dataset, and create a new snapshot. Sending the snapshot should be successful this time around. I’ve discovered this to be an issue that eventually goes away on its own. Just deal with it a couple of times, and it will likely go away for you too.

cyberpower678 avatar Jan 30 '23 03:01 cyberpower678

I’m venturing a guess your affected dataset is an older dataset from an older ZFS version. On newer versions of ZFS, some dataset metadata left behind by older versions don’t seem to play nice with the new versions upon sending and receiving them. The idea here is to go back as many snapshots as needed, to get to the point where the last snapshot sent was still working on the backup. Make a note of every file that’s been modified, or created on the dataset since the last working snapshot, and copy them out of the pool. Roll back the source dataset to the last snapshot that sent successfully to the backup. Do the same for the destination. Delete all the files on source that you copied out, that still existed at that point. Copy files back into the dataset, and create a new snapshot. Sending the snapshot should be successful this time around. I’ve discovered this to be an issue that eventually goes away on its own. Just deal with it a couple of times, and it will likely go away for you too.

So is there a way to detect which version of zfs the snapshot is from? Now, I don't mind losing the old snapshots. I'm ok with only having relatively recent snapshots. But I'm on Debian Bullseye, so my zfs version shouldn't have changed. Think it's 2.0.3.

The pool itself, appears to be from December 2021 on truenas core, freebsd. I'm not sure what version of zfs they were using back then.

If it helps, here's my syncoid script

#!/bin/sh FILESYSTEMS="Data" LOG=/var/log/syncoid.log for i in $FILESYSTEMS; do echo "--------------- date - syncing $i -------------" >> $LOG 2>&1 /usr/sbin/syncoid -r --exclude=Data/Encrypted/Media --exclude=Data/Unencrypted/Media/Collections --exclude="Data/Unencrypted/Media/TV Shows" --exclude=Data/Unencrypted/Media/Movies --exclude=Data/Unencrypted/Media/Music --exclude=Data/Unencrypted/Media/Temp --exclude=Data/Encrypted/Users/USER/Backups/Desktop --exclude=Data/Encrypted/Users/USER/Backups/Laptop --sendoptions="wp" --skip-parent --no-sync-snap $i root@backup:Backup/$i >> $LOG 2>&1 echo "--------------- date - done $i -------------" >> $LOG 2>&1 done

TBhatti90 avatar Jan 30 '23 05:01 TBhatti90

I don't really think your description of the problem, or how to work around it, is accurate, @cyberpower678.

More precisely, if someone who can reproduce this reliably wants to add a few debug prints the next time this happens to them, it's probably pretty straightforward to see why it's getting upset? I have a couple of wild guesses, but without a reproducer I can take home with me, I'm going to get to ask whoever to try things and see how it breaks.

e: reading back, no, this is just the same known issue, probably. Go ask Debian to cherrypick #13782 if you want it to work, I think, or run 2.1.7 or newer.

rincebrain avatar Jan 30 '23 06:01 rincebrain

I don't really think your description of the problem, or how to work around it, is accurate, @cyberpower678.

More precisely, if someone who can reproduce this reliably wants to add a few debug prints the next time this happens to them, it's probably pretty straightforward to see why it's getting upset? I have a couple of wild guesses, but without a reproducer I can take home with me, I'm going to get to ask whoever to try things and see how it breaks.

e: reading back, no, this is just the same known issue, probably. Go ask Debian to cherrypick #13782 if you want it to work, I think, or run 2.1.7 or newer.

It’s accurate for me. When I did this, what ever was causing zfs receive to either corrupt the dataset, or throw a panic was gone and the subsequent receive was successful. After 3 more instances of this happening, and doing my tedious workaround, it stopped happening entirely. It’s been almost a year since the last occurrence.

cyberpower678 avatar Jan 30 '23 16:01 cyberpower678

I’m venturing a guess your affected dataset is an older dataset from an older ZFS version. On newer versions of ZFS, some dataset metadata left behind by older versions don’t seem to play nice with the new versions upon sending and receiving them. The idea here is to go back as many snapshots as needed, to get to the point where the last snapshot sent was still working on the backup. Make a note of every file that’s been modified, or created on the dataset since the last working snapshot, and copy them out of the pool. Roll back the source dataset to the last snapshot that sent successfully to the backup. Do the same for the destination. Delete all the files on source that you copied out, that still existed at that point. Copy files back into the dataset, and create a new snapshot. Sending the snapshot should be successful this time around. I’ve discovered this to be an issue that eventually goes away on its own. Just deal with it a couple of times, and it will likely go away for you too.

So is there a way to detect which version of zfs the snapshot is from? Now, I don't mind losing the old snapshots. I'm ok with only having relatively recent snapshots. But I'm on Debian Bullseye, so my zfs version shouldn't have changed. Think it's 2.0.3.

The pool itself, appears to be from December 2021 on truenas core, freebsd. I'm not sure what version of zfs they were using back then.

If it helps, here's my syncoid script

#!/bin/sh FILESYSTEMS="Data" LOG=/var/log/syncoid.log for i in $FILESYSTEMS; do echo "--------------- date - syncing $i -------------" >> $LOG 2>&1 /usr/sbin/syncoid -r --exclude=Data/Encrypted/Media --exclude=Data/Unencrypted/Media/Collections --exclude="Data/Unencrypted/Media/TV Shows" --exclude=Data/Unencrypted/Media/Movies --exclude=Data/Unencrypted/Media/Music --exclude=Data/Unencrypted/Media/Temp --exclude=Data/Encrypted/Users/USER/Backups/Desktop --exclude=Data/Encrypted/Users/USER/Backups/Laptop --sendoptions="wp" --skip-parent --no-sync-snap $i root@backup:Backup/$i >> $LOG 2>&1 echo "--------------- date - done $i -------------" >> $LOG 2>&1 done

@rincebrain may have a point that I may not be on point here. I’m laying out to you what worked for me. The general idea is if older snapshots made it to the destination just fine, you keep those. Only delete the snapshots, that are failing to send/receive successfully. Identify what has changed since the last successful snapshot, rollback to the last working snapshot after copying the changed/added files out, and then completely replace said files (delete and copy back) in the dataset, and create a new snapshot based on the last successful one. Try sending that to the destination. If it still panics, then the only option for now is to simply create a new dataset on the source, and move the files to the new dataset, deleting the old one and send the whole thing again. Really annoying, but most effective in the long-run, from my experience.

cyberpower678 avatar Jan 30 '23 16:01 cyberpower678

Or you could just apply the patch that fixes the error on the receiver.

rincebrain avatar Jan 30 '23 17:01 rincebrain

Or you could just apply the patch that fixes the error on the receiver.

That would be a great solution. Too bad I'm not familiar with how ZFS works internally to actually fix it myself. :-( So, I just work around issues until they are fixed, or are no longer issues.

cyberpower678 avatar Jan 30 '23 17:01 cyberpower678

I wasn't suggesting you write a fix. I was suggesting you use 2.1.7 or newer, which have this fixed, or get your distro du jour to apply the relevant fix.

rincebrain avatar Jan 30 '23 17:01 rincebrain

Hey so just want to make sure I'm understanding correctly, 2.1.7 fixes this right? Debian has that in the backports, along with a newer kernel. I can check that out. Would I have to upgrade my pool too?

If there's any way I can get some debug print statements for you guys, I can provide that. The error seems pretty consistent. Always with the one dataset.

TBhatti90 avatar Jan 30 '23 18:01 TBhatti90

Or you could just apply the patch that fixes the error on the receiver.

That would be a great solution. Too bad I'm not familiar with how ZFS works internally to actually fix it myself. So, I just work around issues until they are fixed, or are no longer issues.

I wasn't suggesting you write a fix. I was suggesting you use 2.1.7 or newer, which have this fixed, or get your distro du jour to apply the relevant fix.

Oh. Whoops. Now that I reread your initial comment, that does make more sense. lol That being said, I haven't dealt with this issue since I initially opened it here. My encrypted incremental have been replicating without issue.

cyberpower678 avatar Jan 30 '23 19:01 cyberpower678

You don't need to pull a newer kernel to use the newer version of ZFS, you can just pull the zfs packages from backports.

You also shouldn't need to run "zpool upgrade" or otherwise tweak any feature flags.

My understanding is that redoing the snapshot, essentially, would just reroll the dice on whether this triggered, because if I understand the bug correctly, in certain cases it wasn't setting that it used large blocks before attempting to make a large block, so at which point specific blocks got included in the send stream (plus variation in your local system) could affect whether this broke for you or not. So I would speculate it's conceivably possible for someone to write a zstream command that reordered when various things were in the send stream so this didn't come up? I'm just spitballing though.

rincebrain avatar Jan 30 '23 19:01 rincebrain

Hey so just want to make sure I'm understanding correctly, 2.1.7 fixes this right? Debian has that in the backports, along with a newer kernel. I can check that out. Would I have to upgrade my pool too?

If there's any way I can get some debug print statements for you guys, I can provide that. The error seems pretty consistent. Always with the one dataset.

Pool upgrades shouldn't be needed, but it shouldn't hurt. Though I'm not sure about upgrading ZFS in your particular installation since it's TrueNAS.

The latest TrueNAS SCALE is on 2.1.6

zfs --version

zfs-2.1.6-1 zfs-kmod-2.1.6-1

cyberpower678 avatar Jan 30 '23 19:01 cyberpower678

Just wanted to report the same issue happened with 2.1.7 with Debian 6.0.0. This is the same on the source and destination. The source and destination have an initial snapshot from 12/21/21. The source has many snapshots after that, as it's the primary server. The source was sending a snapshot from 1/30/23, and that's when the kernel panic happened on the destination.

So you want me to delete the new snapshots and paste the changed files in the old snapshot? That sounds difficult.

[23638.869168] VERIFY3(0 == dmu_object_set_blocksize(rwa->os, drro->drr_object, drro->drr_blksz, drro->drr_indblkshift, tx)) failed (0 == 95) [23638.869231] PANIC at dmu_recv.c:1784:receive_object() [23638.869255] Showing stack for process 21420 [23638.869256] CPU: 0 PID: 21420 Comm: receive_writer Tainted: P OE 6.0.0-0.deb11.6-amd64 #1 Debian 6.0.12-1~bpo11+1 [23638.869259] Hardware name: ASUS All Series/Z97I-PLUS, BIOS 2704 02/18/2016 [23638.869260] Call Trace: [23638.869261] <TASK> [23638.869263] dump_stack_lvl+0x45/0x5e [23638.869270] spl_panic+0xd1/0xe9 [spl] [23638.869279] ? _raw_spin_unlock+0x15/0x30 [23638.869282] ? dnode_hold_impl+0x4eb/0xee0 [zfs] [23638.869355] ? preempt_count_add+0x70/0xa0 [23638.869358] ? preempt_count_add+0x70/0xa0 [23638.869360] ? _raw_spin_lock+0x13/0x40 [23638.869362] ? dnode_rele_and_unlock+0x59/0xf0 [zfs] [23638.869406] ? _raw_spin_unlock+0x15/0x30 [23638.869408] receive_object+0xc30/0xca0 [zfs] [23638.869452] ? dnode_rele_and_unlock+0x59/0xf0 [zfs] [23638.869515] ? preempt_count_add+0x70/0xa0 [23638.869517] receive_writer_thread+0x1cc/0xad0 [zfs] [23638.869559] ? set_next_task_fair+0x2d/0xd0 [23638.869563] ? _raw_spin_unlock_irqrestore+0x23/0x40 [23638.869565] ? receive_process_write_record+0x1a0/0x1a0 [zfs] [23638.869606] ? __thread_exit+0x20/0x20 [spl] [23638.869621] ? thread_generic_wrapper+0x59/0x70 [spl] [23638.869626] thread_generic_wrapper+0x59/0x70 [spl] [23638.869631] kthread+0xe7/0x110 [23638.869633] ? kthread_complete_and_exit+0x20/0x20 [23638.869635] ret_from_fork+0x22/0x30 [23638.869639] </TASK>

TBhatti90 avatar Jan 31 '23 03:01 TBhatti90

Neat. Could you share precisely what zfs version reports on the receiver? I want to go see what's in your precise version...

e: My next suggestion if 2.1.7 didn't resolve it for you would be to try 2.1.9, since 2.1.8 includes f806306ce07e0b0bfe994159ec1c75a2c3281d88 which further revised the code that c8d2ab05e1e6bf21185723fabdebbe5bb3374381 touched.

rincebrain avatar Jan 31 '23 03:01 rincebrain