zfs
zfs copied to clipboard
Panic on zfs receive
System information
Type | Version/Name |
---|---|
Distribution Name | FreeBSD |
Distribution Version | 13.1-RELEASE |
Kernel Version | 13.1-RELEASE |
Architecture | amd64 |
OpenZFS Version | zfs-kmod-2.1.4-FreeBSD_g52bad4f23 |
Describe the problem you're observing
Kernel panic shortly after starting zrepl job.
Describe how to reproduce the problem
The system has unencrypted filesystems on a RAIDZ backing to an encrypted FS on a single USB disk (a rotating offsite backup) using zrepl. The power went out on the USB enclosure and the FS wedged so I ended up rebooting the system. After that subsequent attempts to send result in the following panic.
Include any warning/errors/backtraces from the system logs
Kernel panic:
panic: VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)-›dd_child_dir_zapobj,name,
sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
cpuid= 5
time = 1656330570
KDB: stack backtrace:
#0 0xffffffff80c69465 at kdb_backtrace+0x65
#1 0xffffffff80c1bblf at vpanic+0x17f
#2 0xffffffff8219bf4a at spl_panic+0×3a
#3 0xffffffff82235236 at dsl_dir_create_sync+0x1e6
#4 0xffffffff8222423e at dsl_dataset_create_sync+0x3e
#5 0xffffffff8220a993 at du_recv_begin_sync+0x183
#6 0xffffffff8224f704 at dsl_sync_task_sync+0xb4
#7 0xffffffff822422fb at dsl_pool_sync+0x42b
#8 0xffffffff822782d0 at spa_sync+0xade
#9 0xffffffff8228cc47 at txg_sync_thread+0x237
#10 0xffffffff80bd8a5e at fork exit+0x7e
#11 0xffffffff8108859e at fork_trampoline+0xe
I tried the OpenZFS port (version v2022021400-zfs_9f734e81f) and it happens with that version as well.
Function offsets are slightly different but the function stack is the same.
What happened?
Oops, I managed to close the bug by mistake..
I have another backup HD and I reproduced it by (accidentally) exporting the pool while the backup was running - the next (and subsequent) runs cause the panic.
I observed the same error (below) on the following configuration. The error happened immediately following the SSH connection establishing the receive (incremental backup from TrueNAS-SCALE-22.02.2). Following the panicked txg_sync, further zfs operations hang.
OS: Linux (NixOS 21.11)
Arch: aarch64 (arm64)
Kernel Version: 5.17.3
zfs --version
: zfs-2.1.4-1 zfs-kmod-2.1.4-1
send-stream source (truenas) zfs --version
: zfs-2.1.4-1 zfs-kmod-2.1.4-1
Pool configuration:
single USB HDD in the pool. The issue is not with the drive being spun down as I both ran zpool status
and touched a file on the pool (and waited a good 10 seconds for the txg to sync) which both completed successfully.
VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Showing stack for process 869
CPU: 2 PID: 869 Comm: txg_sync Tainted: P C O 5.17.3 #1-NixOS
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace+0x114/0x120
show_stack+0x20/0x68
dump_stack_lvl+0x68/0x84
dump_stack+0x18/0x34
spl_dumpstack+0x2c/0x38 [spl]
spl_panic+0xdc/0xf4 [spl]
dsl_dir_create_sync+0x228/0x2a0 [zfs]
dsl_dataset_create_sync+0x70/0x328 [zfs]
dmu_recv_begin_sync+0x2f8/0x888 [zfs]
dsl_sync_task_sync+0xc8/0x130 [zfs]
dsl_pool_sync+0x3e8/0x4f8 [zfs]
spa_sync+0x518/0xf00 [zfs]
txg_sync_thread+0x2ac/0x450 [zfs]
thread_generic_wrapper+0x68/0x80 [spl]
kthread+0xe4/0xf0
ret_from_fork+0x10/0x20
Following this, the txg_sync that failed gets stuck with the following error repeatedly showing up in the logs
INFO: task txg_sync:869 blocked for more than 362 seconds.
Tainted: P C O 5.17.3 #1-NixOS
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:txg_sync state:D stack: 0 pid: 869 ppid: 2 flags:0x00000208
Call trace:
__switch_to+0x10c/0x1d0
__schedule+0x824/0xf58
schedule+0x50/0xe0
spl_panic+0xf0/0xf4 [spl]
dsl_dir_create_sync+0x228/0x2a0 [zfs]
dsl_dataset_create_sync+0x70/0x328 [zfs]
dmu_recv_begin_sync+0x2f8/0x888 [zfs]
dsl_sync_task_sync+0xc8/0x130 [zfs]
dsl_pool_sync+0x3e8/0x4f8 [zfs]
spa_sync+0x518/0xf00 [zfs]
txg_sync_thread+0x2ac/0x450 [zfs]
thread_generic_wrapper+0x68/0x80 [spl]
kthread+0xe4/0xf0
ret_from_fork+0x10/0x20
Also happened with zfs 2.1.5.
VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Showing stack for process 3064
CPU: 3 PID: 3064 Comm: txg_sync Tainted: P OE 5.18.0-0.bpo.1-amd64 #1 Debian 5.18.2-1~bpo11+1
Hardware name: GIGABYTE MX33-BS1-V1/MX33-BS1-V1, BIOS F04a 10/01/2021
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x5a
spl_panic+0xd1/0xe9 [spl]
dsl_dir_create_sync+0x21a/0x2a0 [zfs]
dsl_dataset_create_sync+0x50/0x390 [zfs]
dmu_recv_begin_sync+0x397/0xa20 [zfs]
? metaslab_class_get_dspace+0x12/0x30 [zfs]
dsl_sync_task_sync+0xa3/0xf0 [zfs]
dsl_pool_sync+0x40b/0x520 [zfs]
spa_sync+0x540/0xfa0 [zfs]
? _raw_spin_lock+0x13/0x30
? spa_txg_history_init_io+0x101/0x110 [zfs]
txg_sync_thread+0x2dc/0x490 [zfs]
? txg_fini+0x250/0x250 [zfs]
? __thread_exit+0x20/0x20 [spl]
thread_generic_wrapper+0x56/0x60 [spl]
kthread+0xe5/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Seems to be happening when receiving a non-encryped snapshot to an encrypted dataset, but the key has not been loaded.
I can confirm this issue.
OS: Linux (Armbian 22.05.3 Bullseye)
Arch: aarch64 (arm64)
Kernel Version: 5.10.123-meson64
zfs: zfs-2.1.5-1~bpo11+1
zfs-kmod-2.1.5-1~bpo11+1
Jul 24 23:20:46 odroidhc4 kernel: VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t)>
Jul 24 23:20:46 odroidhc4 kernel: PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Jul 24 23:20:46 odroidhc4 kernel: Showing stack for process 2299
Jul 24 23:20:46 odroidhc4 kernel: CPU: 3 PID: 2299 Comm: txg_sync Tainted: P C OE 5.10.123-meson64 #22.05.3
Jul 24 23:20:46 odroidhc4 kernel: Hardware name: Hardkernel ODROID-HC4 (DT)
Jul 24 23:20:46 odroidhc4 kernel: Call trace:
Jul 24 23:20:46 odroidhc4 kernel: dump_backtrace+0x0/0x200
Jul 24 23:20:46 odroidhc4 kernel: show_stack+0x18/0x68
Jul 24 23:20:46 odroidhc4 kernel: dump_stack+0xd8/0x134
Jul 24 23:20:46 odroidhc4 kernel: spl_dumpstack+0x24/0x30 [spl]
Jul 24 23:20:46 odroidhc4 kernel: spl_panic+0xd0/0xe8 [spl]
Jul 24 23:20:46 odroidhc4 kernel: dsl_dir_create_sync+0x220/0x2a0 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: dsl_dataset_create_sync+0x64/0x320 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: dmu_recv_begin_sync+0x2ec/0x874 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: dsl_sync_task_sync+0xc0/0x12c [zfs]
Jul 24 23:20:46 odroidhc4 kernel: dsl_pool_sync+0x388/0x494 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: spa_sync+0x4e8/0xe24 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: txg_sync_thread+0x2ac/0x410 [zfs]
Jul 24 23:20:46 odroidhc4 kernel: thread_generic_wrapper+0x60/0x80 [spl]
Jul 24 23:20:46 odroidhc4 kernel: kthread+0x154/0x158
Jul 24 23:20:46 odroidhc4 kernel: ret_from_fork+0x10/0x30
I can also confirm that it happens when receiving unencrypted to encrypted with unloaded keys.
Btw, thanks @siilike for the hint. I would never even realized the keys were unloaded (due to a bug), because I never mount that pool and all usual zfs commands work also with unloaded keys. I was blaming myself for the kernel panic, because I manually constructed a missing kernel header file needed for compilation of zfs as a kernel module.
17 is EEXIST. It appears that dmu_object_alloc is generating an objset ID that has already been recorded by a previous zap_add, presumably right before the interruption of the previous send attempt.
I also have run into this issue while transferring an encrypted dataset (with --raw
on the send) to a machine without keys loaded.
Version Info:
root@cau:~# zfs --version
zfs-2.1.5-pve1
zfs-kmod-2.1.5-pve1
root@cau:~# uname -a
Linux cau 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64 GNU/Linux
Seeing this on Ubuntu as well, same scenario of sending a dataset without keys loaded. ZFS commands hang after the panic with the task zpool:1234 blocked for more than 120 seconds
message, requiring a reboot.
~ $ zfs --version
zfs-0.8.3-1ubuntu12.13
zfs-kmod-2.1.4-0ubuntu0.1
~ $ uname -a
Linux hostname 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
~ $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
~ $
I can confirm that the PANIC happens, when accidentally sending a non-raw zfs send
onto an encrypted dataset with unloaded keys (ie. an encrypted remote backup).
After that panic, all zpool and zfs commands hang and a complete system restart is necessary.
[39079.327825] VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
[39079.340749] PANIC at dsl_dir.c:951:dsl_dir_create_sync()
[39079.347128] Showing stack for process 766
[39079.347148] CPU: 1 PID: 766 Comm: txg_sync Tainted: P C O 5.15.0-1014-raspi #16-Ubuntu
[39079.347156] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[39079.347160] Call trace:
[39079.347162] dump_backtrace+0x0/0x1f0
[39079.347174] show_stack+0x24/0x30
[39079.347178] dump_stack_lvl+0x8c/0xb8
[39079.347188] dump_stack+0x18/0x34
[39079.347194] spl_dumpstack+0x30/0x3c [spl]
[39079.347242] spl_panic+0xdc/0xf4 [spl]
[39079.347270] dsl_dir_create_sync+0x21c/0x29c [zfs]
[39079.347765] dsl_dataset_create_sync+0x70/0x324 [zfs]
[39079.348059] dmu_recv_begin_sync+0x304/0x890 [zfs]
[39079.348347] dsl_sync_task_sync+0xcc/0x140 [zfs]
[39079.348638] dsl_pool_sync+0x348/0x3dc [zfs]
[39079.348925] spa_sync_iterate_to_convergence+0xf4/0x20c [zfs]
[39079.349220] spa_sync+0x2c0/0x534 [zfs]
[39079.349512] txg_sync_thread+0x258/0x2c0 [zfs]
[39079.349761] thread_generic_wrapper+0x6c/0x84 [spl]
[39079.349802] kthread+0x12c/0x140
[39079.349810] ret_from_fork+0x10/0x20
root@player /home/peter $ zfs --version
zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
root@player /home/peter $ uname -a
Linux player.lan 5.15.0-1014-raspi #16-Ubuntu SMP PREEMPT Thu Aug 25 09:50:55 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
root@player /home/peter $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
Same to me (PANIC happens, when accidentally (? shouldn't -w not not be implied in this situation or an error be given?) when sending a non-raw zfs send onto an encrypted dataset with unloaded keys.
zfs --version zfs-2.1.5-pve1 zfs-kmod-2.1.5-pve1
Sep 23 08:41:59 pve kernel: [ 3734.014157] VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17) Sep 23 08:41:59 pve kernel: [ 3734.014162] PANIC at dsl_dir.c:951:dsl_dir_create_sync() Sep 23 08:41:59 pve kernel: [ 3734.014165] Showing stack for process 894 Sep 23 08:41:59 pve kernel: [ 3734.014166] CPU: 1 PID: 894 Comm: txg_sync Tainted: P O 5.15.53-1-pve #1 Sep 23 08:41:59 pve kernel: [ 3734.014182] Hardware name: FUJITSU /D3644-B1, BIOS V5.0.0.13 R1.26.0 for D3644-B1x 03/08/2021 Sep 23 08:41:59 pve kernel: [ 3734.014185] Call Trace: Sep 23 08:41:59 pve kernel: [ 3734.014186] <TASK> Sep 23 08:41:59 pve kernel: [ 3734.014189] dump_stack_lvl+0x4a/0x63 Sep 23 08:41:59 pve kernel: [ 3734.014194] dump_stack+0x10/0x16 Sep 23 08:41:59 pve kernel: [ 3734.014197] spl_dumpstack+0x29/0x2f [spl] Sep 23 08:41:59 pve kernel: [ 3734.014204] spl_panic+0xd1/0xe9 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014211] ? zap_add+0xbe/0xf0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014287] dsl_dir_create_sync+0x212/0x290 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014335] dsl_dataset_create_sync+0x52/0x380 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014381] dmu_recv_begin_sync+0x3b2/0xa70 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014421] ? spa_get_slop_space+0x5d/0xb0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014478] dsl_sync_task_sync+0xb7/0x110 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014528] dsl_pool_sync+0x3dd/0x4f0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014576] spa_sync+0x55a/0x1020 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014631] ? spa_txg_history_init_io+0x10a/0x120 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014688] txg_sync_thread+0x2d3/0x460 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014746] ? txg_init+0x2c0/0x2c0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014803] thread_generic_wrapper+0x61/0x80 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014810] ? __thread_exit+0x20/0x20 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014815] kthread+0x127/0x150 Sep 23 08:41:59 pve kernel: [ 3734.014819] ? set_kthread_struct+0x50/0x50 Sep 23 08:41:59 pve kernel: [ 3734.014822] ret_from_fork+0x1f/0x30 Sep 23 08:41:59 pve kernel: [ 3734.014827] </TASK>
Thanks lopiuh
Sorry for not noticing this issue until now. This bug is raised in #14055 as well. Please see there for a fix.
No worries @AttilaFueloep – thank you much for working on this! Bugs happen, no need to be sorry :)
@MaZderMind Thanks for your kind words, much appreciated!