zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Panic on zfs receive

Open DanielO opened this issue 2 years ago • 11 comments

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 13.1-RELEASE
Kernel Version 13.1-RELEASE
Architecture amd64
OpenZFS Version zfs-kmod-2.1.4-FreeBSD_g52bad4f23

Describe the problem you're observing

Kernel panic shortly after starting zrepl job.

Describe how to reproduce the problem

The system has unencrypted filesystems on a RAIDZ backing to an encrypted FS on a single USB disk (a rotating offsite backup) using zrepl. The power went out on the USB enclosure and the FS wedged so I ended up rebooting the system. After that subsequent attempts to send result in the following panic.

Include any warning/errors/backtraces from the system logs

Kernel panic:

panic: VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)-›dd_child_dir_zapobj,name,
sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
cpuid= 5
time = 1656330570
KDB: stack backtrace:
#0 0xffffffff80c69465 at kdb_backtrace+0x65
#1 0xffffffff80c1bblf at vpanic+0x17f
#2 0xffffffff8219bf4a at spl_panic+0×3a
#3 0xffffffff82235236 at dsl_dir_create_sync+0x1e6
#4 0xffffffff8222423e at dsl_dataset_create_sync+0x3e
#5 0xffffffff8220a993 at du_recv_begin_sync+0x183
#6 0xffffffff8224f704 at dsl_sync_task_sync+0xb4
#7 0xffffffff822422fb at dsl_pool_sync+0x42b
#8 0xffffffff822782d0 at spa_sync+0xade
#9 0xffffffff8228cc47 at txg_sync_thread+0x237
#10 0xffffffff80bd8a5e at fork exit+0x7e
#11 0xffffffff8108859e at fork_trampoline+0xe

DanielO avatar Jun 27 '22 23:06 DanielO

I tried the OpenZFS port (version v2022021400-zfs_9f734e81f) and it happens with that version as well.

Function offsets are slightly different but the function stack is the same.

DanielO avatar Jun 28 '22 03:06 DanielO

What happened?

rincebrain avatar Jun 28 '22 04:06 rincebrain

Oops, I managed to close the bug by mistake..

DanielO avatar Jun 28 '22 04:06 DanielO

I have another backup HD and I reproduced it by (accidentally) exporting the pool while the backup was running - the next (and subsequent) runs cause the panic.

DanielO avatar Jun 29 '22 00:06 DanielO

I observed the same error (below) on the following configuration. The error happened immediately following the SSH connection establishing the receive (incremental backup from TrueNAS-SCALE-22.02.2). Following the panicked txg_sync, further zfs operations hang.

OS: Linux (NixOS 21.11) Arch: aarch64 (arm64) Kernel Version: 5.17.3 zfs --version: zfs-2.1.4-1 zfs-kmod-2.1.4-1 send-stream source (truenas) zfs --version: zfs-2.1.4-1 zfs-kmod-2.1.4-1

Pool configuration: single USB HDD in the pool. The issue is not with the drive being spun down as I both ran zpool status and touched a file on the pool (and waited a good 10 seconds for the txg to sync) which both completed successfully.

VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Showing stack for process 869
CPU: 2 PID: 869 Comm: txg_sync Tainted: P         C O      5.17.3 #1-NixOS
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
 dump_backtrace+0x114/0x120
 show_stack+0x20/0x68
 dump_stack_lvl+0x68/0x84
 dump_stack+0x18/0x34
 spl_dumpstack+0x2c/0x38 [spl]
 spl_panic+0xdc/0xf4 [spl]
 dsl_dir_create_sync+0x228/0x2a0 [zfs]
 dsl_dataset_create_sync+0x70/0x328 [zfs]
 dmu_recv_begin_sync+0x2f8/0x888 [zfs]
 dsl_sync_task_sync+0xc8/0x130 [zfs]
 dsl_pool_sync+0x3e8/0x4f8 [zfs]
 spa_sync+0x518/0xf00 [zfs]
 txg_sync_thread+0x2ac/0x450 [zfs]
 thread_generic_wrapper+0x68/0x80 [spl]
 kthread+0xe4/0xf0
 ret_from_fork+0x10/0x20

Following this, the txg_sync that failed gets stuck with the following error repeatedly showing up in the logs

INFO: task txg_sync:869 blocked for more than 362 seconds.
      Tainted: P         C O      5.17.3 #1-NixOS
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:txg_sync        state:D stack:    0 pid:  869 ppid:     2 flags:0x00000208
Call trace:
 __switch_to+0x10c/0x1d0
 __schedule+0x824/0xf58
 schedule+0x50/0xe0
 spl_panic+0xf0/0xf4 [spl]
 dsl_dir_create_sync+0x228/0x2a0 [zfs]
 dsl_dataset_create_sync+0x70/0x328 [zfs]
 dmu_recv_begin_sync+0x2f8/0x888 [zfs]
 dsl_sync_task_sync+0xc8/0x130 [zfs]
 dsl_pool_sync+0x3e8/0x4f8 [zfs]
 spa_sync+0x518/0xf00 [zfs]
 txg_sync_thread+0x2ac/0x450 [zfs]
 thread_generic_wrapper+0x68/0x80 [spl]
 kthread+0xe4/0xf0
 ret_from_fork+0x10/0x20

jhollowe avatar Jul 07 '22 05:07 jhollowe

Also happened with zfs 2.1.5.

VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Showing stack for process 3064
CPU: 3 PID: 3064 Comm: txg_sync Tainted: P           OE     5.18.0-0.bpo.1-amd64 #1  Debian 5.18.2-1~bpo11+1
Hardware name: GIGABYTE MX33-BS1-V1/MX33-BS1-V1, BIOS F04a 10/01/2021
Call Trace:
 <TASK>
 dump_stack_lvl+0x45/0x5a
 spl_panic+0xd1/0xe9 [spl]
 dsl_dir_create_sync+0x21a/0x2a0 [zfs]
 dsl_dataset_create_sync+0x50/0x390 [zfs]
 dmu_recv_begin_sync+0x397/0xa20 [zfs]
 ? metaslab_class_get_dspace+0x12/0x30 [zfs]
 dsl_sync_task_sync+0xa3/0xf0 [zfs]
 dsl_pool_sync+0x40b/0x520 [zfs]
 spa_sync+0x540/0xfa0 [zfs]
 ? _raw_spin_lock+0x13/0x30
 ? spa_txg_history_init_io+0x101/0x110 [zfs]
 txg_sync_thread+0x2dc/0x490 [zfs]
 ? txg_fini+0x250/0x250 [zfs]
 ? __thread_exit+0x20/0x20 [spl]
 thread_generic_wrapper+0x56/0x60 [spl]
 kthread+0xe5/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>

Seems to be happening when receiving a non-encryped snapshot to an encrypted dataset, but the key has not been loaded.

siilike avatar Jul 17 '22 11:07 siilike

I can confirm this issue.

OS: Linux (Armbian 22.05.3 Bullseye) Arch: aarch64 (arm64) Kernel Version: 5.10.123-meson64 zfs: zfs-2.1.5-1~bpo11+1 zfs-kmod-2.1.5-1~bpo11+1

Jul 24 23:20:46 odroidhc4 kernel: VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t)>
Jul 24 23:20:46 odroidhc4 kernel: PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Jul 24 23:20:46 odroidhc4 kernel: Showing stack for process 2299
Jul 24 23:20:46 odroidhc4 kernel: CPU: 3 PID: 2299 Comm: txg_sync Tainted: P         C OE     5.10.123-meson64 #22.05.3
Jul 24 23:20:46 odroidhc4 kernel: Hardware name: Hardkernel ODROID-HC4 (DT)
Jul 24 23:20:46 odroidhc4 kernel: Call trace:
Jul 24 23:20:46 odroidhc4 kernel:  dump_backtrace+0x0/0x200
Jul 24 23:20:46 odroidhc4 kernel:  show_stack+0x18/0x68
Jul 24 23:20:46 odroidhc4 kernel:  dump_stack+0xd8/0x134
Jul 24 23:20:46 odroidhc4 kernel:  spl_dumpstack+0x24/0x30 [spl]
Jul 24 23:20:46 odroidhc4 kernel:  spl_panic+0xd0/0xe8 [spl]
Jul 24 23:20:46 odroidhc4 kernel:  dsl_dir_create_sync+0x220/0x2a0 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  dsl_dataset_create_sync+0x64/0x320 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  dmu_recv_begin_sync+0x2ec/0x874 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  dsl_sync_task_sync+0xc0/0x12c [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  dsl_pool_sync+0x388/0x494 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  spa_sync+0x4e8/0xe24 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  txg_sync_thread+0x2ac/0x410 [zfs]
Jul 24 23:20:46 odroidhc4 kernel:  thread_generic_wrapper+0x60/0x80 [spl]
Jul 24 23:20:46 odroidhc4 kernel:  kthread+0x154/0x158
Jul 24 23:20:46 odroidhc4 kernel:  ret_from_fork+0x10/0x30

I can also confirm that it happens when receiving unencrypted to encrypted with unloaded keys.

Btw, thanks @siilike for the hint. I would never even realized the keys were unloaded (due to a bug), because I never mount that pool and all usual zfs commands work also with unloaded keys. I was blaming myself for the kernel panic, because I manually constructed a missing kernel header file needed for compilation of zfs as a kernel module.

Greek64 avatar Jul 27 '22 06:07 Greek64

17 is EEXIST. It appears that dmu_object_alloc is generating an objset ID that has already been recorded by a previous zap_add, presumably right before the interruption of the previous send attempt.

PaulZ-98 avatar Aug 02 '22 16:08 PaulZ-98

I also have run into this issue while transferring an encrypted dataset (with --raw on the send) to a machine without keys loaded.

Version Info:
root@cau:~# zfs --version
zfs-2.1.5-pve1
zfs-kmod-2.1.5-pve1
root@cau:~# uname -a
Linux cau 5.15.39-4-pve #1 SMP PVE 5.15.39-4 (Mon, 08 Aug 2022 15:11:15 +0200) x86_64 GNU/Linux

Cadair avatar Sep 05 '22 10:09 Cadair

Seeing this on Ubuntu as well, same scenario of sending a dataset without keys loaded. ZFS commands hang after the panic with the task zpool:1234 blocked for more than 120 seconds message, requiring a reboot.

~ $ zfs --version
zfs-0.8.3-1ubuntu12.13
zfs-kmod-2.1.4-0ubuntu0.1
~ $ uname -a
Linux hostname 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
~ $ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
~ $ 

n-cc avatar Sep 06 '22 16:09 n-cc

I can confirm that the PANIC happens, when accidentally sending a non-raw zfs send onto an encrypted dataset with unloaded keys (ie. an encrypted remote backup).

After that panic, all zpool and zfs commands hang and a complete system restart is necessary.

[39079.327825] VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
[39079.340749] PANIC at dsl_dir.c:951:dsl_dir_create_sync()
[39079.347128] Showing stack for process 766
[39079.347148] CPU: 1 PID: 766 Comm: txg_sync Tainted: P         C O      5.15.0-1014-raspi #16-Ubuntu
[39079.347156] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[39079.347160] Call trace:
[39079.347162]  dump_backtrace+0x0/0x1f0
[39079.347174]  show_stack+0x24/0x30
[39079.347178]  dump_stack_lvl+0x8c/0xb8
[39079.347188]  dump_stack+0x18/0x34
[39079.347194]  spl_dumpstack+0x30/0x3c [spl]
[39079.347242]  spl_panic+0xdc/0xf4 [spl]
[39079.347270]  dsl_dir_create_sync+0x21c/0x29c [zfs]
[39079.347765]  dsl_dataset_create_sync+0x70/0x324 [zfs]
[39079.348059]  dmu_recv_begin_sync+0x304/0x890 [zfs]
[39079.348347]  dsl_sync_task_sync+0xcc/0x140 [zfs]
[39079.348638]  dsl_pool_sync+0x348/0x3dc [zfs]
[39079.348925]  spa_sync_iterate_to_convergence+0xf4/0x20c [zfs]
[39079.349220]  spa_sync+0x2c0/0x534 [zfs]
[39079.349512]  txg_sync_thread+0x258/0x2c0 [zfs]
[39079.349761]  thread_generic_wrapper+0x6c/0x84 [spl]
[39079.349802]  kthread+0x12c/0x140
[39079.349810]  ret_from_fork+0x10/0x20
root@player /home/peter $ zfs --version
zfs-2.1.4-0ubuntu0.1
zfs-kmod-2.1.4-0ubuntu0.1
root@player /home/peter $ uname -a
Linux player.lan 5.15.0-1014-raspi #16-Ubuntu SMP PREEMPT Thu Aug 25 09:50:55 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
root@player /home/peter $ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

MaZderMind avatar Sep 16 '22 09:09 MaZderMind

Same to me (PANIC happens, when accidentally (? shouldn't -w not not be implied in this situation or an error be given?) when sending a non-raw zfs send onto an encrypted dataset with unloaded keys.

zfs --version zfs-2.1.5-pve1 zfs-kmod-2.1.5-pve1

Sep 23 08:41:59 pve kernel: [ 3734.014157] VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17) Sep 23 08:41:59 pve kernel: [ 3734.014162] PANIC at dsl_dir.c:951:dsl_dir_create_sync() Sep 23 08:41:59 pve kernel: [ 3734.014165] Showing stack for process 894 Sep 23 08:41:59 pve kernel: [ 3734.014166] CPU: 1 PID: 894 Comm: txg_sync Tainted: P O 5.15.53-1-pve #1 Sep 23 08:41:59 pve kernel: [ 3734.014182] Hardware name: FUJITSU /D3644-B1, BIOS V5.0.0.13 R1.26.0 for D3644-B1x 03/08/2021 Sep 23 08:41:59 pve kernel: [ 3734.014185] Call Trace: Sep 23 08:41:59 pve kernel: [ 3734.014186] <TASK> Sep 23 08:41:59 pve kernel: [ 3734.014189] dump_stack_lvl+0x4a/0x63 Sep 23 08:41:59 pve kernel: [ 3734.014194] dump_stack+0x10/0x16 Sep 23 08:41:59 pve kernel: [ 3734.014197] spl_dumpstack+0x29/0x2f [spl] Sep 23 08:41:59 pve kernel: [ 3734.014204] spl_panic+0xd1/0xe9 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014211] ? zap_add+0xbe/0xf0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014287] dsl_dir_create_sync+0x212/0x290 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014335] dsl_dataset_create_sync+0x52/0x380 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014381] dmu_recv_begin_sync+0x3b2/0xa70 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014421] ? spa_get_slop_space+0x5d/0xb0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014478] dsl_sync_task_sync+0xb7/0x110 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014528] dsl_pool_sync+0x3dd/0x4f0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014576] spa_sync+0x55a/0x1020 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014631] ? spa_txg_history_init_io+0x10a/0x120 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014688] txg_sync_thread+0x2d3/0x460 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014746] ? txg_init+0x2c0/0x2c0 [zfs] Sep 23 08:41:59 pve kernel: [ 3734.014803] thread_generic_wrapper+0x61/0x80 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014810] ? __thread_exit+0x20/0x20 [spl] Sep 23 08:41:59 pve kernel: [ 3734.014815] kthread+0x127/0x150 Sep 23 08:41:59 pve kernel: [ 3734.014819] ? set_kthread_struct+0x50/0x50 Sep 23 08:41:59 pve kernel: [ 3734.014822] ret_from_fork+0x1f/0x30 Sep 23 08:41:59 pve kernel: [ 3734.014827] </TASK>

Thanks lopiuh

lopiuh avatar Sep 23 '22 06:09 lopiuh

Sorry for not noticing this issue until now. This bug is raised in #14055 as well. Please see there for a fix.

AttilaFueloep avatar Oct 25 '22 18:10 AttilaFueloep

No worries @AttilaFueloep – thank you much for working on this! Bugs happen, no need to be sorry :)

MaZderMind avatar Oct 25 '22 18:10 MaZderMind

@MaZderMind Thanks for your kind words, much appreciated!

AttilaFueloep avatar Oct 31 '22 23:10 AttilaFueloep