zfs
zfs copied to clipboard
PANIC at dmu_recv.c on receiving snapshot to encrypted file system
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Centos |
| Distribution Version | 7.9.2009 |
| Kernel Version | 3.10.0-1160.42.2.el7.x86_64 |
| Architecture | x86_64 |
| OpenZFS Version | 2.0.6-1 |
Describe the problem you're observing
This issue seems very similar to #12270 and #12001 but has a different stack trace.
The system is a backup target and uses Sanoid/Syncoid, to manage and transfer snapshots from several systems and has been in operation for several years. Whilst running 0.8.6 of ZFS it suffered from regular OOM killer events on the zpool command but was otherwise stable, so was upgraded to 2.0.6 from 0.8.6 on mid-October with the hope that this would improve reliability. It hasn't!
The system has ca 150 ZFS file systems that either ZFS receive snapshots,SSH/rsync targets or run a minio server (as an experiment with object stores, but currently unused). All file systems locally are encrypted aes-256-ccm. Remote sources are a mixture of unencrypted, aes-256-ccm and aes-256-gcm (depending on host ZFS capabilities and age). In addition an offsite ZFS target pulls these snapshots to provide for disaster recovery. This remote system is stable and is on largely identical hardware but only pulls from the host (no transfers out).
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backup_pool 654T 388T 266T - - 6% 59% 1.00x ONLINE -
Host has 64GB RAM.
Snapshots are being created/destroyed on the source and destination throughout long transfers of which we have several that are failing to complete between hangs due to being multi-TB in size.
On hang:
- System load continues to rise
- zpool status reports permanent data errors although no disks show errors
- zpool status -v hangs machine
Syncoid is called with: --no-sync-snap --create-bookmark -r --skip-parent (have experimented with changing the mbuffer-size with little effect)
At the time of the hang the system log contains:
Nov 6 20:05:07 fs3 kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Nov 6 20:05:07 fs3 kernel: PANIC at dmu_recv.c:1811:receive_object()
Nov 6 20:05:07 fs3 kernel: Showing stack for process 12169
Nov 6 20:05:07 fs3 kernel: CPU: 15 PID: 12169 Comm: receive_writer Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.42.2.el7.x86_64 #1
Nov 6 20:05:07 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 6 20:05:07 fs3 kernel: Call Trace:
Nov 6 20:05:07 fs3 kernel: [<ffffffff85583539>] dump_stack+0x19/0x1b
Nov 6 20:05:07 fs3 kernel: [<ffffffffc025ac5b>] spl_dumpstack+0x2b/0x30 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc025ad29>] spl_panic+0xc9/0x110 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0aa476e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffff85588632>] ? down_read+0x12/0x40
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0abd30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ac45ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc025c6b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ad12da>] receive_object+0x64a/0xc80 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ae403b>] ? dnode_rele_and_unlock+0x6b/0xf0 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ac4d60>] ? dmu_free_long_range+0x370/0x510 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc025b7b5>] ? spl_kmem_free+0x35/0x40 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffff851a67fd>] ? list_del+0xd/0x30
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ad4904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov 6 20:05:07 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov 6 20:05:07 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:05:07 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 6 20:05:07 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:05:07 fs3 zed: eid=44379 class=authentication pool='backup_pool' bookmark=185192:0:0:0
Nov 6 20:06:01 fs3 systemd: Started Session 3231 of user root.
Nov 6 20:07:01 fs3 systemd: Started Session 3232 of user root.
Nov 6 20:08:01 fs3 systemd: Started Session 3233 of user root.
Nov 6 20:08:05 fs3 kernel: INFO: task txg_quiesce:7399 blocked for more than 120 seconds.
Nov 6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:08:05 fs3 kernel: txg_quiesce D ffff96bc71d09660 0 7399 2 0x00000000
Nov 6 20:08:05 fs3 kernel: Call Trace:
Nov 6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b5493b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b54660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:08:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:08:05 fs3 kernel: INFO: task receive_writer:12169 blocked for more than 120 seconds.
Nov 6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:08:05 fs3 kernel: receive_writer D ffff96add74f5280 0 12169 2 0x00000080
Nov 6 20:08:05 fs3 kernel: Call Trace:
Nov 6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025ad55>] spl_panic+0xf5/0x110 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0aa476e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffff85588632>] ? down_read+0x12/0x40
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0abd30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ac45ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025c6b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ad12da>] receive_object+0x64a/0xc80 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ae403b>] ? dnode_rele_and_unlock+0x6b/0xf0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ac4d60>] ? dmu_free_long_range+0x370/0x510 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025b7b5>] ? spl_kmem_free+0x35/0x40 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffff851a67fd>] ? list_del+0xd/0x30
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ad4904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:08:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:08:05 fs3 kernel: INFO: task zfs:12640 blocked for more than 120 seconds.
Nov 6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:08:05 fs3 kernel: zfs D ffff96bc6b63a6e0 0 12640 12634 0x00000080
Nov 6 20:08:05 fs3 kernel: Call Trace:
Nov 6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:08:05 fs3 kernel: [<ffffffff85586e41>] schedule_timeout+0x221/0x2d0
Nov 6 20:08:05 fs3 kernel: [<ffffffff84edae02>] ? default_wake_function+0x12/0x20
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec6f7b>] ? autoremove_wake_function+0x2b/0x40
Nov 6 20:08:05 fs3 kernel: [<ffffffff84f06992>] ? ktime_get_ts64+0x52/0xf0
Nov 6 20:08:05 fs3 kernel: [<ffffffff85588a2d>] io_schedule_timeout+0xad/0x130
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov 6 20:08:05 fs3 kernel: [<ffffffff85588ac8>] io_schedule+0x18/0x20
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025a2b2>] cv_wait_common+0xb2/0x150 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:08:05 fs3 kernel: [<ffffffffc025a388>] __cv_wait_io+0x18/0x20 [spl]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b54155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b17d40>] ? dsl_dataset_user_hold_check_one+0x170/0x170 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b541b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b1683c>] dsl_sync_task_common+0xdc/0x2b0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b17660>] ? dsl_onexit_hold_cleanup+0xd0/0xd0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b17d40>] ? dsl_dataset_user_hold_check_one+0x170/0x170 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b17660>] ? dsl_onexit_hold_cleanup+0xd0/0xd0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b16a36>] dsl_sync_task+0x26/0x30 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b18124>] dsl_dataset_user_hold+0x94/0xf0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b95ba1>] zfs_ioc_hold+0xc1/0x1b0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0b9d42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffffc0bc8ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 6 20:08:05 fs3 kernel: [<ffffffff85063ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov 6 20:08:05 fs3 kernel: [<ffffffff85590678>] ? __do_page_fault+0x238/0x500
Nov 6 20:08:05 fs3 kernel: [<ffffffff85063d61>] SyS_ioctl+0xa1/0xc0
Nov 6 20:08:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:09:01 fs3 systemd: Started Session 3234 of user root.
Nov 6 20:10:01 fs3 systemd: Started Session 3235 of user root.
Nov 6 20:10:01 fs3 systemd: Started Session 3236 of user root.
Nov 6 20:10:01 fs3 systemd: Started Session 3237 of user root.
Nov 6 20:10:02 fs3 systemd-logind: New session 3238 of user hbabk0.
Nov 6 20:10:02 fs3 systemd: Started Session 3238 of user hbabk0.
Nov 6 20:10:05 fs3 kernel: INFO: task txg_quiesce:7399 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: txg_quiesce D ffff96bc71d09660 0 7399 2 0x00000000
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0b5493b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0b54660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov 6 20:10:05 fs3 kernel: INFO: task minio:9923 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc96255860 0 9923 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff8558a099>] schedule_preempt_disabled+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffff855881e7>] __mutex_lock_slowpath+0xc7/0x1d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff855875bf>] mutex_lock+0x1f/0x2f
Nov 6 20:10:05 fs3 kernel: [<ffffffff855800d6>] lookup_slow+0x33/0xa7
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505cd9f>] link_path_walk+0x80f/0x8b0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505cfaa>] path_lookupat+0x7a/0x8d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85059137>] ? path_get+0x27/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505d82b>] filename_lookup+0x2b/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85061537>] user_path_at_empty+0x67/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff850615a1>] user_path_at+0x11/0x20
Nov 6 20:10:05 fs3 kernel: [<ffffffff85053fe3>] vfs_fstatat+0x63/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85054454>] SYSC_newfstatat+0x24/0x60
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505487e>] SyS_newfstatat+0xe/0x10
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:10:05 fs3 kernel: INFO: task minio:9924 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc6ab4c7e0 0 9924 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0adfc91>] dmu_tx_assign+0x91/0x490 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0ac4c8b>] dmu_free_long_range+0x29b/0x510 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bd77e9>] zfs_trunc+0x79/0x210 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bdc3a2>] zfs_freesp+0xa2/0x510 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bc1693>] ? zfs_zaccess_aces_check+0x1b3/0x3e0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bc1986>] ? zfs_zaccess_common+0xc6/0x230 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bc2f36>] ? zfs_zaccess+0x106/0x3d0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bd1daf>] zfs_setattr+0x92f/0x2360 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff855875b2>] ? mutex_lock+0x12/0x2f
Nov 6 20:10:05 fs3 kernel: [<ffffffff8506cbec>] ? inode_add_lru+0x3c/0xa0
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be3c1d>] ? zpl_xattr_get_dir+0x9d/0x1c0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff855875b2>] ? mutex_lock+0x12/0x2f
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0b2d63b>] ? rrm_exit+0x4b/0x90 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff85028f01>] ? __kmalloc_node+0x1d1/0x2b0
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be432c>] ? zpl_xattr_get+0x12c/0x1d0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025ba84>] ? spl_kmem_zalloc+0xc4/0x140 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025ba84>] ? spl_kmem_zalloc+0xc4/0x140 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be249b>] zpl_setattr+0x10b/0x1a0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff8506e17c>] notify_change+0x30c/0x4d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8504c2a5>] do_truncate+0x75/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505e637>] do_last+0x627/0x1340
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff850293a6>] ? kmem_cache_alloc_trace+0x1d6/0x200
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505f41d>] path_openat+0xcd/0x5a0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8506166d>] do_filp_open+0x4d/0xb0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8506f767>] ? __alloc_fd+0x47/0x170
Nov 6 20:10:05 fs3 kernel: [<ffffffff8504d404>] do_sys_open+0x124/0x220
Nov 6 20:10:05 fs3 kernel: [<ffffffff8504d534>] SyS_openat+0x14/0x20
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:10:05 fs3 kernel: INFO: task minio:9926 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc962547e0 0 9926 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bcecf9>] zfs_mkdir+0x229/0x720 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff85124c0d>] ? security_transition_sid+0x2d/0x40
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be2ff0>] zpl_mkdir+0xc0/0x1a0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505b80f>] vfs_mkdir+0x10f/0x1d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85061b0a>] SyS_mkdirat+0xca/0x100
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:10:05 fs3 kernel: INFO: task minio:9927 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc6ab4d860 0 9927 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff8558a099>] schedule_preempt_disabled+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffff855881e7>] __mutex_lock_slowpath+0xc7/0x1d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff855875bf>] mutex_lock+0x1f/0x2f
Nov 6 20:10:05 fs3 kernel: [<ffffffff855800d6>] lookup_slow+0x33/0xa7
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505cd9f>] link_path_walk+0x80f/0x8b0
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505cfaa>] path_lookupat+0x7a/0x8d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85059137>] ? path_get+0x27/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505d82b>] filename_lookup+0x2b/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85061537>] user_path_at_empty+0x67/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff850615a1>] user_path_at+0x11/0x20
Nov 6 20:10:05 fs3 kernel: [<ffffffff85053fe3>] vfs_fstatat+0x63/0xc0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85054454>] SYSC_newfstatat+0x24/0x60
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505487e>] SyS_newfstatat+0xe/0x10
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:10:05 fs3 kernel: INFO: task minio:10031 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc66ca5860 0 10031 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bcecf9>] zfs_mkdir+0x229/0x720 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff85124c0d>] ? security_transition_sid+0x2d/0x40
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be2ff0>] zpl_mkdir+0xc0/0x1a0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505b80f>] vfs_mkdir+0x10f/0x1d0
Nov 6 20:10:05 fs3 kernel: [<ffffffff85061b0a>] SyS_mkdirat+0xca/0x100
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov 6 20:10:05 fs3 kernel: INFO: task minio:10032 blocked for more than 120 seconds.
Nov 6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 6 20:10:05 fs3 kernel: minio D ffff96bc60da1660 0 10032 1 0x00000080
Nov 6 20:10:05 fs3 kernel: Call Trace:
Nov 6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0bd02a2>] zfs_rmdir+0x102/0x510 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffffc0be259d>] zpl_rmdir+0x6d/0xc0 [zfs]
Nov 6 20:10:05 fs3 kernel: [<ffffffff8505bcbc>] vfs_rmdir+0xdc/0x150
Nov 6 20:10:05 fs3 kernel: [<ffffffff85060941>] do_rmdir+0x1f1/0x220
Nov 6 20:10:05 fs3 kernel: [<ffffffff85061ba5>] SyS_unlinkat+0x25/0x40
Nov 6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
The minio hangs I believe are just a symptom of the underlying issue, not the cause.
Post reboot the permanent errors are in typically in snapshots (or files within snapshots) and mostly from deleted snapshots:
[root@fs3 cron.d]# zpool status -v
pool: backup_pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 0B in 00:00:20 with 0 errors on Thu Nov 4 12:17:50 2021
config:
NAME STATE READ WRITE CKSUM
backup_pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
35000cca2537ad22c ONLINE 0 0 0
35000cca25380af3c ONLINE 0 0 0
35000cca2701aa128 ONLINE 0 0 0
35000cca2701b8930 ONLINE 0 0 0
35000cca2701bd878 ONLINE 0 0 0
35000cca2701f5750 ONLINE 0 0 0
35000cca2702028f4 ONLINE 0 0 0
35000cca270209964 ONLINE 0 0 0
35000cca27020d35c ONLINE 0 0 0
35000cca27020d3c0 ONLINE 0 0 0
35000cca27020d558 ONLINE 0 0 0
35000cca270226748 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
35000cca27022d244 ONLINE 0 0 0
35000cca270238244 ONLINE 0 0 0
35000cca2702477ac ONLINE 0 0 0
35000cca27024830c ONLINE 0 0 0
35000cca27024f468 ONLINE 0 0 0
35000cca2702597b4 ONLINE 0 0 0
35000cca27025ade8 ONLINE 0 0 0
35000cca27025b110 ONLINE 0 0 0
35000cca27025fc94 ONLINE 0 0 0
35000cca2702693c0 ONLINE 0 0 0
35000cca270269408 ONLINE 0 0 0
35000cca27026b290 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
35000cca27026b320 ONLINE 0 0 0
35000cca27026b428 ONLINE 0 0 0
35000cca27026b56c ONLINE 0 0 0
35000cca27026b5dc ONLINE 0 0 0
35000cca27026b978 ONLINE 0 0 0
35000cca27026b9a8 ONLINE 0 0 0
35000cca27026ba40 ONLINE 0 0 0
35000cca270272e50 ONLINE 0 0 0
35000cca270273c2c ONLINE 0 0 0
35000cca270273e98 ONLINE 0 0 0
35000cca270274174 ONLINE 0 0 0
35000cca27027452c ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
35000cca2702745c4 ONLINE 0 0 0
35000cca270274b50 ONLINE 0 0 0
35000cca270275010 ONLINE 0 0 0
35000cca270275d14 ONLINE 0 0 0
35000cca2702765b8 ONLINE 0 0 0
35000cca2702767e0 ONLINE 0 0 0
35000cca270276d84 ONLINE 0 0 0
35000cca270278108 ONLINE 0 0 0
35000cca270278dfc ONLINE 0 0 0
35000cca270278eb0 ONLINE 0 0 0
35000cca27027b57c ONLINE 0 0 0
35000cca27027b584 ONLINE 0 0 0
raidz2-4 ONLINE 0 0 0
35000cca27027c418 ONLINE 0 0 0
35000cca270280e58 ONLINE 0 0 0
35000cca270281e58 ONLINE 0 0 0
35000cca270282c64 ONLINE 0 0 0
35000cca27028d27c ONLINE 0 0 0
35000cca27028d748 ONLINE 0 0 0
35000cca27028d758 ONLINE 0 0 0
35000cca27028d8fc ONLINE 0 0 0
35000cca27028e5d4 ONLINE 0 0 0
35000cca27028e758 ONLINE 0 0 0
35000cca27028e9dc ONLINE 0 0 0
35000cca27028eba0 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0x2d368>:<0x0>
This happened again today (after a reboot this morning to recover from the above).
Nov 8 14:31:52 fs3 kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Nov 8 14:31:52 fs3 kernel: PANIC at dmu_recv.c:1811:receive_object()
Nov 8 14:31:52 fs3 kernel: Showing stack for process 24727
Nov 8 14:31:52 fs3 kernel: CPU: 7 PID: 24727 Comm: receive_writer Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.42.2.el7.x86_64 #1
Nov 8 14:31:52 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 8 14:31:52 fs3 kernel: Call Trace:
Nov 8 14:31:52 fs3 kernel: [<ffffffff83983539>] dump_stack+0x19/0x1b
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0521c5b>] spl_dumpstack+0x2b/0x30 [spl]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0521d29>] spl_panic+0xc9/0x110 [spl]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov 8 14:31:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov 8 14:31:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:31:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:31:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:31:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:31:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:31:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
<repeated many times>
Nov 8 14:31:54 fs3 zed: eid=33710 class=authentication pool='backup_pool' bookmark=176724:0:0:840
</repeated>
Nov 8 14:32:01 fs3 systemd: Started Session 1159 of user root.
Nov 8 14:33:01 fs3 systemd: Started Session 1160 of user root.
Nov 8 14:33:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov 8 14:33:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:33:52 fs3 kernel: txg_quiesce D ffff8d4fe8f8e8e0 0 8808 2 0x00000000
Nov 8 14:33:52 fs3 kernel: Call Trace:
Nov 8 14:33:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:33:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:33:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov 8 14:33:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:33:52 fs3 kernel: receive_writer D ffff8d501bafb760 0 24727 2 0x00000080
Nov 8 14:33:52 fs3 kernel: Call Trace:
Nov 8 14:33:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov 8 14:33:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov 8 14:33:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:33:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:34:01 fs3 systemd: Started Session 1161 of user root.
Nov 8 14:35:01 fs3 systemd: Started Session 1162 of user root.
Nov 8 14:35:01 fs3 systemd: Started Session 1163 of user root.
Nov 8 14:35:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov 8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:35:52 fs3 kernel: txg_quiesce D ffff8d4fe8f8e8e0 0 8808 2 0x00000000
Nov 8 14:35:52 fs3 kernel: Call Trace:
Nov 8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:35:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:35:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov 8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:35:52 fs3 kernel: receive_writer D ffff8d501bafb760 0 24727 2 0x00000080
Nov 8 14:35:52 fs3 kernel: Call Trace:
Nov 8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov 8 14:35:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov 8 14:35:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:35:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:35:52 fs3 kernel: INFO: task zfs:25014 blocked for more than 120 seconds.
Nov 8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:35:52 fs3 kernel: zfs D ffff8d502d519660 0 25014 7344 0x00000080
Nov 8 14:35:52 fs3 kernel: Call Trace:
Nov 8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:35:52 fs3 kernel: [<ffffffff83986e41>] schedule_timeout+0x221/0x2d0
Nov 8 14:35:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffff83988a2d>] io_schedule_timeout+0xad/0x130
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov 8 14:35:52 fs3 kernel: [<ffffffff83988ac8>] io_schedule+0x18/0x20
Nov 8 14:35:52 fs3 kernel: [<ffffffffc05212b2>] cv_wait_common+0xb2/0x150 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0521388>] __cv_wait_io+0x18/0x20 [spl]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c12155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c121b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c691dc>] zil_close+0x17c/0x240 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c88b61>] zfsvfs_teardown+0xc1/0x2e0 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c88d90>] zfs_suspend_fs+0x10/0x20 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c58189>] zfs_ioc_rollback+0xd9/0x180 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c5b42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffffc0c86ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 8 14:35:52 fs3 kernel: [<ffffffff83463ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov 8 14:35:52 fs3 kernel: [<ffffffff833fc0c7>] ? do_munmap+0x327/0x480
Nov 8 14:35:52 fs3 kernel: [<ffffffff83463d61>] SyS_ioctl+0xa1/0xc0
Nov 8 14:35:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov 8 14:36:01 fs3 systemd: Started Session 1165 of user root.
Nov 8 14:37:01 fs3 systemd: Started Session 1166 of user root.
Nov 8 14:37:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov 8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:37:52 fs3 kernel: txg_quiesce D ffff8d4fe8f8e8e0 0 8808 2 0x00000000
Nov 8 14:37:52 fs3 kernel: Call Trace:
Nov 8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:37:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:37:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov 8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:37:52 fs3 kernel: receive_writer D ffff8d501bafb760 0 24727 2 0x00000080
Nov 8 14:37:52 fs3 kernel: Call Trace:
Nov 8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov 8 14:37:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov 8 14:37:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:37:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:37:52 fs3 kernel: INFO: task zfs:25014 blocked for more than 120 seconds.
Nov 8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:37:52 fs3 kernel: zfs D ffff8d502d519660 0 25014 7344 0x00000080
Nov 8 14:37:52 fs3 kernel: Call Trace:
Nov 8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:37:52 fs3 kernel: [<ffffffff83986e41>] schedule_timeout+0x221/0x2d0
Nov 8 14:37:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff83988a2d>] io_schedule_timeout+0xad/0x130
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov 8 14:37:52 fs3 kernel: [<ffffffff83988ac8>] io_schedule+0x18/0x20
Nov 8 14:37:52 fs3 kernel: [<ffffffffc05212b2>] cv_wait_common+0xb2/0x150 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521388>] __cv_wait_io+0x18/0x20 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c12155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c121b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c691dc>] zil_close+0x17c/0x240 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c88b61>] zfsvfs_teardown+0xc1/0x2e0 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c88d90>] zfs_suspend_fs+0x10/0x20 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c58189>] zfs_ioc_rollback+0xd9/0x180 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c5b42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0c86ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffff83463ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov 8 14:37:52 fs3 kernel: [<ffffffff833fc0c7>] ? do_munmap+0x327/0x480
Nov 8 14:37:52 fs3 kernel: [<ffffffff83463d61>] SyS_ioctl+0xa1/0xc0
Nov 8 14:37:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov 8 14:37:52 fs3 kernel: INFO: task sshd:221387 blocked for more than 120 seconds.
Nov 8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:37:52 fs3 kernel: sshd D ffff8d50007726e0 0 221387 11295 0x00000080
Nov 8 14:37:52 fs3 kernel: Call Trace:
Nov 8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0beaef3>] rrw_enter_read_impl+0x53/0x170 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0beb59f>] rrm_enter_read+0x3f/0x50 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffffc0ca1454>] zpl_show_devname+0x34/0x100 [zfs]
Nov 8 14:37:52 fs3 kernel: [<ffffffff83495f46>] show_mountinfo+0x1d6/0x2d0
Nov 8 14:37:52 fs3 kernel: [<ffffffff834707b9>] m_show+0x19/0x20
Nov 8 14:37:52 fs3 kernel: [<ffffffff83476cf0>] seq_read+0x130/0x450
Nov 8 14:37:52 fs3 kernel: [<ffffffff8344e3df>] vfs_read+0x9f/0x170
Nov 8 14:37:52 fs3 kernel: [<ffffffff8344f25f>] SyS_read+0x7f/0xf0
Nov 8 14:37:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov 8 14:38:01 fs3 systemd: Started Session 1167 of user root.
Nov 8 14:39:01 fs3 systemd: Started Session 1170 of user root.
Nov 8 14:39:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov 8 14:39:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 8 14:39:52 fs3 kernel: txg_quiesce D ffff8d4fe8f8e8e0 0 8808 2 0x00000000
Nov 8 14:39:52 fs3 kernel: Call Trace:
Nov 8 14:39:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov 8 14:39:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov 8 14:39:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov 8 14:39:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov 8 14:39:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov 8 14:39:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov 8 14:39:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
At this point, zfs sends appear to be still in operation (remote target continues to receive data) but anything involving writes is hung.
# zfs iostat 1
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
backup_pool 388T 266T 6.02K 0 101M 0
backup_pool 388T 266T 6.18K 0 99.0M 0
backup_pool 388T 266T 6.94K 0 114M 0
backup_pool 388T 266T 5.42K 0 92.9M 0
Describe how to reproduce the problem
Use sanoid/syncoid to generate and transfer (push and pull) snapshots between systems. Wait...
Include any warning/errors/backtraces from the system logs
I think this is just another flavor of #11679, where a NULL winds up where a valid buffer is expected, and fireworks ensue.
Update on this, I'm seeing this at least once a day now and have had it occur on a host that receives from this host. As it stands the backup host is effectively non-functional.
Further update
After running two complete scrubs of the filesystem the corrupted data errors went away and we were able to resume transfers. As a test Sanoid was disabled, so that there should be snapshot deletion on a file system whilst it was being modified by a receive and things were much more stable. However, leaving it running the large (~20TB) transfer over the weekend I returned this morning to find it hung up again, but this time with a different error:
Nov 20 12:41:15 fs3 kernel: Modules linked in: 8021q garp mrp bonding ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat n
f_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conn
track ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbg ansi_cprng dm_crypt iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqb
ypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_service_time ipmi_ssif joydev lpc_ich sg i2c_i801 mei_me mei wmi ipmi_si ipmi_devintf
Nov 20 12:41:15 fs3 kernel: ipmi_msghandler acpi_power_meter acpi_pad dm_multipath dm_mod binfmt_misc ip_tables xfs libcrc32c zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) ic
p(POE) spl(OE) raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mpt3sas drm crct10dif_pclmul ahci crct10dif_common i40e crc32c_inte
l libahci libata raid_class scsi_transport_sas ptp pps_core nfit drm_panel_orientation_quirks libnvdimm
Nov 20 12:41:15 fs3 kernel: CPU: 5 PID: 49866 Comm: zfs Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.42.2.el7.x86_64 #1
Nov 20 12:41:15 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 20 12:41:15 fs3 kernel: task: ffff9209afde2100 ti: ffff9201d797c000 task.ti: ffff9201d797c000
Nov 20 12:41:15 fs3 kernel: RIP: 0010:[
Data errors are back (on different file systems), so I have some hope that scrubs will clear these.
Is there an error message above the "Modules linked in: ..."? Because I would usually expect one, I think.
...and a bit later on in the logs...
Nov 20 17:47:31 fs3 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [zfs:49866]
Nov 20 17:47:31 fs3 kernel: Modules linked in: 8021q garp mrp bonding ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbg ansi_cprng dm_crypt iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_service_time ipmi_ssif joydev lpc_ich sg i2c_i801 mei_me mei wmi ipmi_si ipmi_devintf
Nov 20 17:47:31 fs3 kernel: ipmi_msghandler acpi_power_meter acpi_pad dm_multipath dm_mod binfmt_misc ip_tables xfs libcrc32c zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mpt3sas drm crct10dif_pclmul ahci crct10dif_common i40e crc32c_intel libahci libata raid_class scsi_transport_sas ptp pps_core nfit drm_panel_orientation_quirks libnvdimm
Nov 20 17:47:31 fs3 kernel: CPU: 1 PID: 49866 Comm: zfs Kdump: loaded Tainted: P OEL ------------ 3.10.0-1160.42.2.el7.x86_64 #1
Nov 20 17:47:31 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 20 17:47:31 fs3 kernel: task: ffff9209afde2100 ti: ffff9201d797c000 task.ti: ffff9201d797c000
Nov 20 17:47:31 fs3 kernel: RIP: 0010:[<ffffffff8e78b795>] [<ffffffff8e78b795>] _raw_spin_unlock_irqrestore+0x15/0x20
Nov 20 17:47:31 fs3 kernel: RSP: 0018:ffff9201d797f8d8 EFLAGS: 00000246
Nov 20 17:47:31 fs3 kernel: RAX: 0000000000000246 RBX: 0000000000000001 RCX: dead000000000200
Nov 20 17:47:31 fs3 kernel: RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
Nov 20 17:47:31 fs3 kernel: RBP: ffff9201d797f8d8 R08: ffff9201d797f900 R09: 0000000000000001
Nov 20 17:47:31 fs3 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff91fe822abe80
Nov 20 17:47:31 fs3 kernel: R13: ffff9201d797f8d8 R14: ffff91fe822abe80 R15: 0000000000000000
Nov 20 17:47:31 fs3 kernel: FS: 00007fd2bdd167c0(0000) GS:ffff920afbc40000(0000) knlGS:0000000000000000
Nov 20 17:47:31 fs3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 17:47:31 fs3 kernel: CR2: 00007f16d72c2000 CR3: 00000009399cc000 CR4: 00000000007607e0
Nov 20 17:47:31 fs3 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 20 17:47:31 fs3 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 20 17:47:31 fs3 kernel: PKRU: 55555554
Nov 20 17:47:31 fs3 kernel: Call Trace:
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02ccc3d>] taskq_wait_outstanding+0x4d/0xf0 [spl]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c85af9>] zfsvfs_teardown+0x59/0x2e0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c86419>] zfs_umount+0x39/0x120 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c88fd6>] zfs_resume_fs+0x106/0x340 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c562e7>] zfs_ioc_recv_impl+0xa57/0xfd0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0b785e6>] ? dbuf_read+0x3d6/0x570 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c56dd4>] zfs_ioc_recv_new+0x2b4/0x330 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02ceb59>] ? spl_vmem_alloc+0x19/0x20 [spl]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e933f>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e3760>] ? nv_mem_zalloc.isra.13+0x30/0x40 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e3a17>] ? nvlist_xalloc.part.14+0x97/0x190 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c5842b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c83ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e263ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e790678>] ? __do_page_fault+0x238/0x500
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e263d61>] SyS_ioctl+0xa1/0xc0
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e795f92>] system_call_fastpath+0x25/0x2a
Nov 20 17:47:31 fs3 kernel: Code: 07 00 0f 1f 40 00 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 48
In answer to your question, no, it just looks like the zfs command that was being run at the time froze up. I had to trigger a hardware reset as a shutdown -r now did very little.
Correction, there was a NMI watchdog line, must have failed to copy that the first time, it's for the same process ID as the second call trace.
Huh. I wonder what state is wrong that it's stuck forever on, because I don't think I expect ordinary "blocked taskq" to trigger one of those.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Bumping this as this is something I've been running into with exactly the same symptoms as above. This essentially makes native zfs encryption completely unusable for me.
Also, there's another somewhat similar issue: sending from encrypted fs also leads to seemingly recoverable "permanent data errors" on intermediate from-snapshots. It doesn't lead to any kernel errors, and recovering from it is simply deleting an offending snapshot.
You might find the fix for #13709 useful for the latter, w23, because I don't know what else might break those.
Thanks for the link. After reading it a bit I'm not sure if that's relevant to issues I'm seeing. I.e. the pools and encrypted filesystems get unlocked and mounted just fine. It's send/recv that has issues.
It feels like it's some sort of pool-wide race condition, as doing more ops to completely different filesystem subtrees (i.e. unrelated by nesting, and having different keys) severely exacerbates the problem. These ops include stuff like:
- copying big files to a different filesystem
- just
zfs listing snapshots on an unrelated filesystem
Transfer speed also affects the probability of hitting it. It can be fine for weeks if transfers are around few hundred KiB/sec. If it gets to a few dozen MiB/sec, it usually panics within a few minutes.
I contemplated making a standalone test, e.g. a minimal qemu image with a script doing zfs send|zfs recv in a loop, to get a clear and consistent reproduction scenario, but so far haven't been able to find time to prepare it.
I was talking about the intermediate snapshots failing to unlock, as I was remarking that it might be that problem.