zfs ZFS 0.8: kworker sucking up 100% CPU, various messages in logs mentioning ZFS, system eventually hangs.

System information

Type	Version/Name
Distribution Name	Ubuntu Server
Distribution Version	19.10 Eoan Ermine
Linux Kernel	5.3.0-13-generic #14-Ubuntu SMP Tue Sep 24 02:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Architecture	amd64
ZFS Version	0.8.1-1ubuntu11

Describe the problem you're observing

After some hours of system uptime, a kworker process will suck up 100% of a core. The system load averages will climb constantly. This kworker process appears to be ZFS related (as of this writing, the current CPU hog is named kworker/u33:3+flush-zfs-19).

This system was originally on Ubuntu Server 18.04 with ZFS 0.8.1 that I had compiled some time before. It was at several months of uptime with no issues. Out of nowhere, this issue came up.

Things that have been tried:

Rebooting several times.
Upgrading to the Ubuntu HWE kernel
Updating to the development release of Ubuntu 19.10 (which comes with ZFS 0.8) to rule out problems with the version I compiled.

I've tried everything I can think of, and since various ZFS-related terms come up in the kworker name and messages printed to dmesg, I am left to assume it is a problem in ZFS.

Describe how to reproduce the problem

Unknown. The system this is occurring on has been stable with minimum in the way of changes for about a year. The system is a host to various LXC containers. The root filesystem is ZFS, on a single 500GB Crucial SSD. All data for the LXC containers is stored on an encrypted storage pool. Output of zpool status:

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:02:31 with 0 errors on Mon Oct  7 22:35:32 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda1      ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:55:30 with 0 errors on Mon Oct  7 23:35:51 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        storage                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000cca24300d8f4  ONLINE       0     0     0
            scsi-35000cca243014464  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000cca243015924  ONLINE       0     0     0
            scsi-35000cca24400a9a4  ONLINE       0     0     0

errors: No known data errors

Include any warning/errors/backtraces from the system logs

Not sure what would be helpful for debugging, but here's some stuff that was printed to dmesg:

[47488.869573] INFO: task pip:19670 blocked for more than 120 seconds.
[47488.869627]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47488.869670] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47488.869718] pip             D    0 19670  19606 0x00000120
[47488.869722] Call Trace:
[47488.869732]  __schedule+0x2b9/0x6c0
[47488.869736]  ? bit_wait_io+0x50/0x50
[47488.869738]  schedule+0x42/0xb0
[47488.869741]  bit_wait+0x11/0x50
[47488.869743]  __wait_on_bit+0x33/0xa0
[47488.869749]  __inode_wait_for_writeback+0xae/0xe0
[47488.869753]  ? var_wake_function+0x30/0x30
[47488.869756]  inode_wait_for_writeback+0x27/0x40
[47488.869759]  evict+0xbd/0x1b0
[47488.869761]  iput+0x141/0x1f0
[47488.869764]  do_unlinkat+0x1c5/0x2d0
[47488.869767]  __x64_sys_unlinkat+0x38/0x60
[47488.869770]  do_syscall_64+0x5a/0x130
[47488.869774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47488.869777] RIP: 0033:0x7f7a4ac39d77
[47488.869784] Code: Bad RIP value.
[47488.869786] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47488.869788] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47488.869790] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47488.869791] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47488.869792] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47488.869793] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47488.869797] INFO: task sync:20009 blocked for more than 120 seconds.
[47488.869840]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47488.869883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47488.869933] sync            D    0 20009  20008 0x00004120
[47488.869936] Call Trace:
[47488.869940]  __schedule+0x2b9/0x6c0
[47488.869943]  schedule+0x42/0xb0
[47488.869945]  wb_wait_for_completion+0x64/0x90
[47488.869948]  ? wait_woken+0x80/0x80
[47488.869950]  sync_inodes_sb+0xd7/0x290
[47488.869954]  sync_inodes_one_sb+0x15/0x20
[47488.869957]  iterate_supers+0xa3/0x110
[47488.869959]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47488.869962]  ksys_sync+0x42/0xb0
[47488.869964]  __ia32_sys_sync+0xe/0x20
[47488.869967]  do_syscall_64+0x5a/0x130
[47488.869970]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47488.869972] RIP: 0033:0x7f56f6bd7287
[47488.869976] Code: Bad RIP value.
[47488.869977] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47488.869979] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47488.869981] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47488.869982] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47488.869984] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47488.869985] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47609.702068] INFO: task pip:19670 blocked for more than 241 seconds.
[47609.702114]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47609.702150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47609.702191] pip             D    0 19670  19606 0x00000120
[47609.702194] Call Trace:
[47609.702203]  __schedule+0x2b9/0x6c0
[47609.702206]  ? bit_wait_io+0x50/0x50
[47609.702209]  schedule+0x42/0xb0
[47609.702210]  bit_wait+0x11/0x50
[47609.702212]  __wait_on_bit+0x33/0xa0
[47609.702218]  __inode_wait_for_writeback+0xae/0xe0
[47609.702221]  ? var_wake_function+0x30/0x30
[47609.702223]  inode_wait_for_writeback+0x27/0x40
[47609.702226]  evict+0xbd/0x1b0
[47609.702227]  iput+0x141/0x1f0
[47609.702230]  do_unlinkat+0x1c5/0x2d0
[47609.702232]  __x64_sys_unlinkat+0x38/0x60
[47609.702235]  do_syscall_64+0x5a/0x130
[47609.702238]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47609.702240] RIP: 0033:0x7f7a4ac39d77
[47609.702247] Code: Bad RIP value.
[47609.702248] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47609.702251] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47609.702252] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47609.702253] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47609.702254] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47609.702255] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47609.702258] INFO: task sync:20009 blocked for more than 241 seconds.
[47609.702293]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47609.702328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47609.702368] sync            D    0 20009  20008 0x00004120
[47609.702370] Call Trace:
[47609.702373]  __schedule+0x2b9/0x6c0
[47609.702376]  schedule+0x42/0xb0
[47609.702378]  wb_wait_for_completion+0x64/0x90
[47609.702379]  ? wait_woken+0x80/0x80
[47609.702381]  sync_inodes_sb+0xd7/0x290
[47609.702384]  sync_inodes_one_sb+0x15/0x20
[47609.702386]  iterate_supers+0xa3/0x110
[47609.702388]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47609.702391]  ksys_sync+0x42/0xb0
[47609.702393]  __ia32_sys_sync+0xe/0x20
[47609.702395]  do_syscall_64+0x5a/0x130
[47609.702398]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47609.702399] RIP: 0033:0x7f56f6bd7287
[47609.702402] Code: Bad RIP value.
[47609.702403] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47609.702405] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47609.702406] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47609.702407] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47609.702408] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47609.702409] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47730.534430] INFO: task journal-offline:32058 blocked for more than 120 seconds.
[47730.534481]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.534516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.534558] journal-offline D    0 32058  27128 0x00000120
[47730.534561] Call Trace:
[47730.534570]  __schedule+0x2b9/0x6c0
[47730.534573]  schedule+0x42/0xb0
[47730.534575]  schedule_preempt_disabled+0xe/0x10
[47730.534578]  __mutex_lock.isra.0+0x182/0x4f0
[47730.534581]  __mutex_lock_slowpath+0x13/0x20
[47730.534583]  mutex_lock+0x2e/0x40
[47730.534698]  zil_commit_impl+0x148/0xd60 [zfs]
[47730.534766]  ? zpl_write_common_iovec+0x100/0x100 [zfs]
[47730.534769]  ? _cond_resched+0x19/0x30
[47730.534771]  ? mutex_lock+0x13/0x40
[47730.534840]  zil_commit+0x40/0x60 [zfs]
[47730.534908]  zpl_writepages+0xec/0x180 [zfs]
[47730.534912]  do_writepages+0x43/0xd0
[47730.534916]  __filemap_fdatawrite_range+0xc4/0x100
[47730.534919]  filemap_write_and_wait_range+0x36/0x90
[47730.534987]  zpl_fsync+0x3d/0xa0 [zfs]
[47730.534991]  vfs_fsync_range+0x49/0x80
[47730.534993]  ? __fget_light+0x57/0x70
[47730.534995]  do_fsync+0x3d/0x70
[47730.534997]  __x64_sys_fsync+0x14/0x20
[47730.535001]  do_syscall_64+0x5a/0x130
[47730.535003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535006] RIP: 0033:0x7f872e0af237
[47730.535012] Code: Bad RIP value.
[47730.535014] RSP: 002b:00007f8729766cb0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[47730.535016] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f872e0af237
[47730.535017] RDX: 0000000000000000 RSI: 00007f872dcb762c RDI: 0000000000000014
[47730.535018] RBP: 00007f872dcb8da0 R08: 00007f8729767700 R09: 00007f8729767700
[47730.535019] R10: 000000000000062c R11: 0000000000000293 R12: 0000000000000002
[47730.535020] R13: 0000000000000000 R14: 00005647819af200 R15: 00007ffce0398f10
[47730.535120] INFO: task pip:19670 blocked for more than 362 seconds.
[47730.535155]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.535190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.535230] pip             D    0 19670  19606 0x00000120
[47730.535232] Call Trace:
[47730.535235]  __schedule+0x2b9/0x6c0
[47730.535238]  ? bit_wait_io+0x50/0x50
[47730.535240]  schedule+0x42/0xb0
[47730.535242]  bit_wait+0x11/0x50
[47730.535243]  __wait_on_bit+0x33/0xa0
[47730.535246]  __inode_wait_for_writeback+0xae/0xe0
[47730.535249]  ? var_wake_function+0x30/0x30
[47730.535252]  inode_wait_for_writeback+0x27/0x40
[47730.535254]  evict+0xbd/0x1b0
[47730.535256]  iput+0x141/0x1f0
[47730.535258]  do_unlinkat+0x1c5/0x2d0
[47730.535260]  __x64_sys_unlinkat+0x38/0x60
[47730.535262]  do_syscall_64+0x5a/0x130
[47730.535265]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535266] RIP: 0033:0x7f7a4ac39d77
[47730.535270] Code: Bad RIP value.
[47730.535271] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47730.535273] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47730.535274] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47730.535275] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47730.535276] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47730.535277] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47730.535280] INFO: task sync:20009 blocked for more than 362 seconds.
[47730.535314]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47730.535349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47730.535389] sync            D    0 20009  20008 0x00004120
[47730.535391] Call Trace:
[47730.535394]  __schedule+0x2b9/0x6c0
[47730.535396]  schedule+0x42/0xb0
[47730.535398]  wb_wait_for_completion+0x64/0x90
[47730.535400]  ? wait_woken+0x80/0x80
[47730.535402]  sync_inodes_sb+0xd7/0x290
[47730.535405]  sync_inodes_one_sb+0x15/0x20
[47730.535407]  iterate_supers+0xa3/0x110
[47730.535409]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47730.535411]  ksys_sync+0x42/0xb0
[47730.535413]  __ia32_sys_sync+0xe/0x20
[47730.535416]  do_syscall_64+0x5a/0x130
[47730.535418]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47730.535420] RIP: 0033:0x7f56f6bd7287
[47730.535423] Code: Bad RIP value.
[47730.535424] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47730.535425] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47730.535427] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47730.535428] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47730.535429] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47730.535429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[47851.366840] INFO: task journal-offline:32058 blocked for more than 241 seconds.
[47851.366900]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.366944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.366994] journal-offline D    0 32058  27128 0x00000124
[47851.366998] Call Trace:
[47851.367007]  __schedule+0x2b9/0x6c0
[47851.367011]  schedule+0x42/0xb0
[47851.367014]  schedule_preempt_disabled+0xe/0x10
[47851.367017]  __mutex_lock.isra.0+0x182/0x4f0
[47851.367021]  __mutex_lock_slowpath+0x13/0x20
[47851.367023]  mutex_lock+0x2e/0x40
[47851.367145]  zil_commit_impl+0x148/0xd60 [zfs]
[47851.367224]  ? zpl_write_common_iovec+0x100/0x100 [zfs]
[47851.367227]  ? _cond_resched+0x19/0x30
[47851.367230]  ? mutex_lock+0x13/0x40
[47851.367309]  zil_commit+0x40/0x60 [zfs]
[47851.367387]  zpl_writepages+0xec/0x180 [zfs]
[47851.367392]  do_writepages+0x43/0xd0
[47851.367396]  __filemap_fdatawrite_range+0xc4/0x100
[47851.367400]  filemap_write_and_wait_range+0x36/0x90
[47851.367478]  zpl_fsync+0x3d/0xa0 [zfs]
[47851.367483]  vfs_fsync_range+0x49/0x80
[47851.367486]  ? __fget_light+0x57/0x70
[47851.367489]  do_fsync+0x3d/0x70
[47851.367491]  __x64_sys_fsync+0x14/0x20
[47851.367495]  do_syscall_64+0x5a/0x130
[47851.367498]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367501] RIP: 0033:0x7f872e0af237
[47851.367508] Code: Bad RIP value.
[47851.367510] RSP: 002b:00007f8729766cb0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[47851.367512] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f872e0af237
[47851.367514] RDX: 0000000000000000 RSI: 00007f872dcb762c RDI: 0000000000000014
[47851.367515] RBP: 00007f872dcb8da0 R08: 00007f8729767700 R09: 00007f8729767700
[47851.367516] R10: 000000000000062c R11: 0000000000000293 R12: 0000000000000002
[47851.367518] R13: 0000000000000000 R14: 00005647819af200 R15: 00007ffce0398f10
[47851.367618] INFO: task pip:19670 blocked for more than 483 seconds.
[47851.367660]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.367704] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.367753] pip             D    0 19670  19606 0x00000120
[47851.367756] Call Trace:
[47851.367760]  __schedule+0x2b9/0x6c0
[47851.367764]  ? bit_wait_io+0x50/0x50
[47851.367766]  schedule+0x42/0xb0
[47851.367769]  bit_wait+0x11/0x50
[47851.367771]  __wait_on_bit+0x33/0xa0
[47851.367775]  __inode_wait_for_writeback+0xae/0xe0
[47851.367779]  ? var_wake_function+0x30/0x30
[47851.367782]  inode_wait_for_writeback+0x27/0x40
[47851.367785]  evict+0xbd/0x1b0
[47851.367787]  iput+0x141/0x1f0
[47851.367790]  do_unlinkat+0x1c5/0x2d0
[47851.367793]  __x64_sys_unlinkat+0x38/0x60
[47851.367795]  do_syscall_64+0x5a/0x130
[47851.367799]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367801] RIP: 0033:0x7f7a4ac39d77
[47851.367804] Code: Bad RIP value.
[47851.367806] RSP: 002b:00007fff4c816858 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[47851.367808] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f7a4ac39d77
[47851.367809] RDX: 0000000000000000 RSI: 00007f7a44c4bda0 RDI: 0000000000000005
[47851.367811] RBP: 0000555a9b59b240 R08: 0000000000000000 R09: 00007fff4c816510
[47851.367812] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a4aad01b0
[47851.367814] R13: 00007f7a4aac8e08 R14: 0000555a9a878240 R15: 0000555a9b9eb460
[47851.367817] INFO: task sync:20009 blocked for more than 483 seconds.
[47851.367859]       Tainted: P           O      5.3.0-13-generic #14-Ubuntu
[47851.367902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47851.367952] sync            D    0 20009  20008 0x00004120
[47851.367954] Call Trace:
[47851.367958]  __schedule+0x2b9/0x6c0
[47851.367961]  schedule+0x42/0xb0
[47851.367963]  wb_wait_for_completion+0x64/0x90
[47851.367966]  ? wait_woken+0x80/0x80
[47851.367968]  sync_inodes_sb+0xd7/0x290
[47851.367972]  sync_inodes_one_sb+0x15/0x20
[47851.367975]  iterate_supers+0xa3/0x110
[47851.367977]  ? page_cache_pipe_buf_steal.cold+0x1c/0x1c
[47851.367980]  ksys_sync+0x42/0xb0
[47851.367983]  __ia32_sys_sync+0xe/0x20
[47851.367985]  do_syscall_64+0x5a/0x130
[47851.367989]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[47851.367991] RIP: 0033:0x7f56f6bd7287
[47851.367994] Code: Bad RIP value.
[47851.367995] RSP: 002b:00007fff2c486078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[47851.367998] RAX: ffffffffffffffda RBX: 00007fff2c4861a8 RCX: 00007f56f6bd7287
[47851.367999] RDX: 00007f56f6eae001 RSI: 0000000000000000 RDI: 00007f56f6c754b0
[47851.368000] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[47851.368002] R10: 000055d597318010 R11: 0000000000000246 R12: 0000000000000001
[47851.368003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

I can grab any other debugging information that would be helpful, just let me know what commands to run to grab it.

EDIT: Oh, and I suspect it may have SOMETHING to do with one particular LXC container. The container is stored in a ZFS dataset. Inside the container, the only thing running is Nextcloud, installed from Snapcraft.

Oct 08 '19 22:10 ReimuHakurei

check smart info for all disks if some of those is about to fail, also watch out for scsi/disk errors in dmesg, maybe this is an error occuring after some disk hiccup. also make sure there is enough free space on both pools

Oct 09 '19 17:10 devZer0

Both pools are nowhere near full. SMART data on all disks is fine. If the problem was caused by the disks, I would expect to see high iowait. However, there is no iowait.

Oct 09 '19 17:10 ReimuHakurei

I would suggest trying v0.8.2 since there was a kworker/rollback deadlock fixed in that release. Your posted stacks don't confirm its the same issue but it sounds like it could be from the description. See #9203 and related issues for details on the deadlock.

Oct 09 '19 17:10 alek-p

Alright, I'll give that a try. I see #7038 was referenced there, and it looks to me like this may be the same issue as that.

Oct 09 '19 17:10 ReimuHakurei

I've updated the system to 0.8.2. I think I've pinpointed the thing that's triggering the problem (snapfuse in an LXC container spinning, doing lots of I/O, sucking up an entire core) since that seems to have popped up around the same time, but I'll forgo fixing that for a few days to see if I can confirm if 0.8.2 fixes the problem.

Oct 10 '19 08:10 ReimuHakurei

Alright. Prior to updating to 0.8.2, it took 12-24 hours for this issue to crop up, and it's been 2 days with no sign of it, so I'm inclined to believe that this is now fixed.

Thanks!

Oct 12 '19 02:10 ReimuHakurei

I've run into this with 0.8.2.

Oct 13 '19 19:10 Ninpo

Actually, yep, looks like last night it occurred again. Based on my sample size of 1, 0.8.2 seems to have improved it, since it took 4 days to occur rather than less than one day, but it has occurred again.

The system is currently still responsive. Before it completely locks up and forces me to reboot, and before I just disable the VM that I am fairly sure is causing the problem, is there any debugging info I can gather to assist in determining the root cause of the problem?

Oct 13 '19 21:10 ReimuHakurei

Perhaps attach zpool status and zfs get all output?

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:18 with 0 errors on Fri Oct 11 01:09:10 2019
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p4  ONLINE       0     0     0
            nvme1n1p4  ONLINE       0     0     0

errors: No known data errors

lz4 compression is enabled, defaults otherwise.

Oct 13 '19 23:10 Ninpo

This has just happened again:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                 
120819 root      20   0       0      0      0 R 100.0   0.0   9:46.79 kworker/u256:2+flush-zfs-1

While attempting to compile libcap.

[188621.756418] INFO: task emerge:9987 blocked for more than 614 seconds.
[188621.756419]       Tainted: P           OE     5.2.16-gentoo #4
[188621.756419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[188621.756420] emerge          D    0  9987      1 0x00004004
[188621.756421] Call Trace:
[188621.756423]  ? __schedule+0x250/0x660
[188621.756426]  ? trace_hardirqs_on+0x31/0xd0
[188621.756427]  schedule+0x39/0xa0
[188621.756428]  wb_wait_for_completion+0x5e/0x90
[188621.756430]  ? wait_woken+0x70/0x70
[188621.756431]  sync_inodes_sb+0xd8/0x280
[188621.756433]  sync_filesystem+0x5e/0x90
[188621.756434]  __se_sys_syncfs+0x38/0x70
[188621.756435]  do_syscall_64+0x5a/0x130
[188621.756436]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[188621.756437] RIP: 0033:0x7f6f9f5e1d97
[188621.756439] Code: Bad RIP value.
[188621.756439] RSP: 002b:00007ffd7c463e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000132
[188621.756440] RAX: ffffffffffffffda RBX: 00007ffd7c463e40 RCX: 00007f6f9f5e1d97
[188621.756441] RDX: 00007f6f9f4b0b80 RSI: 0000000073ffa500 RDI: 000000000000000d
[188621.756442] RBP: 00007ffd7c463e40 R08: 00007f6f9eb73cc8 R09: 000055643a62ef68
[188621.756442] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd7c463f88
[188621.756443] R13: 0000000000000000 R14: 00007ffd7c463ef0 R15: 00007ffd7c463f68

I'm tempted to go back to using git master, when I was on zfs-9999 just prior to 0.8.2 release I didn't see this (was using git build as 0.8.x didn't build against 5.2.x kernel at the time) and I did note that when I switched to stable 0.8.2, I had unsupported "future" features on my pool (I rebuilt it so I had 0.8.2 clean).

Oct 14 '19 18:10 Ninpo

Might it be worth comparing some system configurations?

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:02:31 with 0 errors on Mon Oct  7 22:35:32 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda1      ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:55:30 with 0 errors on Mon Oct  7 23:35:51 2019
config:

        NAME                        STATE     READ WRITE CKSUM
        storage                     ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            scsi-35000cca24300d8f4  ONLINE       0     0     0
            scsi-35000cca243014464  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            scsi-35000cca243015924  ONLINE       0     0     0
            scsi-35000cca24400a9a4  ONLINE       0     0     0

errors: No known data errors

The disks are connected to a generic LSI SAS2 HBA. The rpool is a single Crucial MX500 SSD connected to SAS port 0 on the HBA. The entire storage pool is encrypted. The disks in the storage pool are Hitachi Ultrastar 7K4000 SAS HDDs. They are connected to ports 4-7.

The system is a Dell PowerEdge R415, with 2x AMD Opteron(tm) Processor 4365 EE processors, and 80 GB of RAM (4x4GB + 4x16GB).

I've narrowed down what I believe to be the cause (a broken snap package inside of an LXC container; container is on storage pool. snapfuse is sucking up an entire core and reading from disk at a constant roughly 6 MB/sec), but I have avoided fixing the issue to both verify the issue was fixed (it's not), and to assist in diagnosing the root cause.

The next time this system completely locks up and I have to hard reboot it, I'm fixing the problem that I believe is triggering it, so if there are any commands I can run or logs I can fetch that would assist the ZFS developers in determining the root cause, I need to know what those are sooner, rather than later.

Oct 15 '19 04:10 ReimuHakurei

What brought you to the snap package? I ask as I have one on my entire system that as it happens, is active when the issue occurs. At this point I'm sure it's coincidence, buuut...

$ zpool status
  pool: misc
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:06 with 0 errors on Fri Oct 11 01:09:02 2019
config:

        NAME                                                                 STATE     READ WRITE CKSUM
        misc                                                                 ONLINE       0     0     0
          raidz1-0                                                           ONLINE       0     0     0
            ata-WDC_WD1001FALS-00J7B1_WD-WMATV2740378                        ONLINE       0     0     0
            ata-WDC_WD1001FALS-00J7B1_WD-WMATV2432116                        ONLINE       0     0     0
            ata-WDC_WD1001FALS-00J7B1_WD-WMATV2432663                        ONLINE       0     0     0
        logs
          mirror-1                                                           ONLINE       0     0     0
            ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB-part1TW0XFJWX5508539D1600  ONLINE       0     0     0
            ata-KINGSTON_SH103S3240G_50026B72390010E4-part1                  ONLINE       0     0     0
        cache
          ata-LITEONIT_LCS-256M6S_2.5_7mm_256GB-part2TW0XFJWX5508539D1600    ONLINE       0     0     0
          ata-KINGSTON_SH103S3240G_50026B72390010E4-part2                    ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:18 with 0 errors on Fri Oct 11 01:09:10 2019
config:

        NAME           STATE     READ WRITE CKSUM
        rpool          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme1n1p4  ONLINE       0     0     0
            nvme0n1p4  ONLINE       0     0     0

errors: No known data errors

System is a Threadripper 1950X, 64GB RAM, ASRock mobo. Samsung NVME disks in rpool, ten year old WD Black and a couple of SSDs in misc. Kernel 5.2.16.

It's rpool that's active and hangs when the issue occurs.

Oct 15 '19 08:10 Ninpo

snapfuse associated with that snap package (Nextcloud) has been sucking up an entire core and doing the constant I/O since an auto-update of the snap package in question occurred. The date of this auto update lines up very closely to the date the issue started appearing.

Oct 15 '19 08:10 ReimuHakurei

could you describe what snapfuse is/does? there seems not too much information around on the net... i don't find a manpage or a git repo...

Oct 15 '19 10:10 devZer0

snap packages are distributed as squashfs images. Normally, those are just mounted using the native kernel support as loopback devices, but in my case, I have snap running inside of an LXC container. You can't use loop devices inside of LXC. As a workaround for this, snap comes with snapfuse, which is an implementation of squashfs in userspace using FUSE, which is supported inside of LXC containers.

Oct 15 '19 15:10 ReimuHakurei

Hmm, fuse you say...

franz-5.3.3-x86_64.AppImage on /tmp/.mount_franz-gpQL6L type fuse.franz-5.3.3-x86_64.AppImage (ro,nosuid,nodev,relatime,user_id=1000,group_id=1000)

Oct 15 '19 15:10 Ninpo

It looks to me like the only things actually hanging here are sync commands. Is it the same for you? I think this is the cause of my load averages spiking like they do.

# ps axu | grep " D " 
165536     302  0.0  0.0   6172   732 ?        D    02:00   0:00 sync
165536     307  0.0  0.0   6172   736 ?        D    10:15   0:00 sync
165536     647  0.0  0.0   6172   736 ?        D    Oct14   0:00 sync
165536     745  0.0  0.0   6172   628 ?        D    00:20   0:00 sync
165536     760  0.0  0.0   6172   736 ?        D    Oct14   0:00 sync
[snip]
165536   32052  0.0  0.0   6172   736 ?        D    Oct14   0:00 sync
165536   32084  0.0  0.0   6172   624 ?        D    Oct13   0:00 sync
165536   32104  0.0  0.0   6172   732 ?        D    Oct13   0:00 sync
165536   32273  0.0  0.0   6172   732 ?        D    Oct14   0:00 sync
165536   32307  0.0  0.0   6172   624 ?        D    03:35   0:00 sync
165536   32320  0.0  0.0   6172   728 ?        D    Oct13   0:00 sync
165536   32337  0.0  0.0   6172   628 ?        D    Oct14   0:00 sync
165536   32364  0.0  0.0   6172   628 ?        D    Oct13   0:00 sync
165536   32365  0.0  0.0   6172   732 ?        D    05:20   0:00 sync
165536   32435  0.0  0.0   6172   732 ?        D    Oct14   0:00 sync
165536   32474  0.0  0.0   6172   628 ?        D    Oct14   0:00 sync
165536   32477  0.0  0.0   6172   628 ?        D    Oct13   0:00 sync
165536   32603  0.0  0.0   6172   628 ?        D    Oct14   0:00 sync
165536   32647  0.0  0.0   6172   624 ?        D    08:30   0:00 sync
165536   32710  0.0  0.0   6172   628 ?        D    05:10   0:00 sync

Oct 15 '19 17:10 ReimuHakurei

I'm going to replace sync with a copy of true on the container in question to see if this fixes my load average issues and makes this more manageable.

Oct 15 '19 17:10 ReimuHakurei

I'm unsure, I only saw emerge and gzip "hang" last time but forgot to grab a full process list before rebooting (needed system back ASAP). When it happens again I'll check.

Does it matter re: sync if it's just replaced with true or something when it comes to ZFS?

Oct 15 '19 18:10 Ninpo

I'm not sure. It's certainly not optimal but I would assume it can't really get any worse than the process just hanging in uninterruptible sleep.

Oct 15 '19 18:10 ReimuHakurei

Hmm, fuse you say...

franz-5.3.3-x86_64.AppImage on /tmp/.mount_franz-gpQL6L type fuse.franz-5.3.3-x86_64.AppImage (ro,nosuid,nodev,relatime,user_id=1000,group_id=1000)

mind that franz has severe memory leaks

Oct 16 '19 08:10 devZer0

The system load averages will climb constantly

i think this is normal as with every hanging sync there is one more process waiting for I/O and every process in the run-queue waiting for completion will add up to the system load (it does not need to burn cpu). you can easily demonstrate this with stale nfs mounts

Oct 16 '19 09:10 devZer0

instead of rebooting please try stopping all lxc stuff and try to unmount all fuse-mounts, if that doesn't work normally , try -f (force)

then have a look if the sync processes go away or try to kill those

maybe the sync is hanging because of fuse...

Oct 16 '19 10:10 devZer0

When sync hangs in D state inside the container, you cannot stop the LXC container. Any attempt to do so will just freeze.

Anyway, I will say, 0.8.2 has certainly at the very least made this problem slightly better. On 0.8.1 it was consistently occurring on a daily basis. 0.8.2 it went about 4 days the first time, and I'm at almost 5 days since then now and it's still working fine. I'll update the issue when it next occurs.

Oct 22 '19 22:10 ReimuHakurei

maybe related ?

https://superuser.com/questions/1439108/lxc-start-stop-hangs-and-filesystem-sync-hangs

Oct 22 '19 22:10 devZer0

Nope.

By definition, when a task is in D state (uninterruptible sleep), absolutely nothing (short of a hard reboot) will end it. Because restarting an LXC container requires killing all processes, and you cannot kill a process in D state, you cannot restart an LXC container with any child processes in D state.

Oct 22 '19 22:10 ReimuHakurei

This has hit me again.

Anyone experiencing this on kernels <5.0.0?

Jan 10 '20 14:01 Ninpo

I'm experiencing this on Linux storagelols 5.0.0-16-generic #17~18.04.1-Ubuntu SMP, but no real way to debug.

Feb 01 '20 19:02 ukd1

We were hit by the problem on 4.19 looks like the problem accumulates deferred threads after some time. some threads can still continue

relevant thread shows after sysrq:

[6756902.750815] CPU: 12 PID: 37541 Comm: kworker/u49:4 Tainted: P W O 4.19.86-gentoo #1 [6756902.750816] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013 [6756902.750817] Workqueue: writeback wb_workfn (flush-zfs-30) [6756902.750818] RIP: 0010:dbuf_verify+0x158/0x990 [zfs] [6756902.750819] Code: 4d 85 e4 74 15 49 3b 5c 24 20 0f 85 86 01 00 00 4d 8b 64 24 28 4d 85 e4 75 eb 80 7b 50 00 75 0a 48 83 3b 00 0f 84 36 02 00 00 <4c> 8b 43 48 4d 85 c0 74 6f 48 8b 43 30 49 3b 45 68 0f 84 0e 03 00 [6756902.750819] RSP: 0018:ffffb11b55b57728 EFLAGS: 00000246 [6756902.750820] RAX: 0000000000000000 RBX: ffff9f3d298d93f8 RCX: 0000000000000000 [6756902.750821] RDX: 0000000000000001 RSI: ffffffffc0c688f8 RDI: ffff9f3aa7ea6448 [6756902.750821] RBP: ffffb11b55b57750 R08: 0000000001bec000 R09: 0000000001bec000 [6756902.750822] R10: ffff9f09229a86c0 R11: 00000000000011ce R12: 0000000000000000 [6756902.750822] R13: ffff9f3aa7ea5fa0 R14: 00000000000006fb R15: ffff9f3aa7ea5fa0 [6756902.750823] FS: 0000000000000000(0000) GS:ffff9f217f980000(0000) knlGS:0000000000000000 [6756902.750823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [6756902.750824] CR2: 00007fc31495601e CR3: 0000003e7240c005 CR4: 00000000001626e0 [6756902.750824] Call Trace: [6756902.750824] dbuf_hold_impl_arg+0x12e/0xb70 [zfs] [6756902.750825] ? spl_kmem_alloc+0xf0/0x170 [spl] [6756902.750825] dbuf_hold_impl+0x95/0xc0 [zfs] [6756902.750826] dbuf_hold+0x33/0x60 [zfs] [6756902.750826] dnode_hold_impl+0x17a/0x10c0 [zfs] [6756902.750827] ? dbuf_rele_and_unlock+0x336/0x800 [zfs] [6756902.750827] ? zfs_refcount_remove_many+0x215/0x290 [zfs] [6756902.750827] ? zfs_refcount_destroy_many+0x159/0x200 [zfs] [6756902.750828] dnode_hold+0x1b/0x20 [zfs] [6756902.750828] dmu_bonus_hold+0x36/0x90 [zfs] [6756902.750829] sa_buf_hold+0xe/0x10 [zfs] [6756902.750829] zfs_zget+0x10d/0x310 [zfs] [6756902.750829] zfs_get_data+0x7e/0x520 [zfs] [6756902.750830] zil_commit_impl+0x1235/0x1c90 [zfs] [6756902.750830] zil_commit+0x59/0x160 [zfs] [6756902.750830] zpl_writepages+0xec/0x180 [zfs] [6756902.750831] do_writepages+0x41/0xd0 [6756902.750831] ? tracing_record_taskinfo_skip+0x3f/0x50 [6756902.750832] ? tracing_record_taskinfo+0x13/0xb0 [6756902.750833] ? fprop_fraction_percpu+0x2f/0x80 [6756902.750834] __writeback_single_inode+0x4d/0x390 [6756902.750834] writeback_sb_inodes+0x23c/0x570 [6756902.750835] wb_writeback+0xff/0x320 [6756902.750835] wb_workfn+0xf3/0x450 [6756902.750836] ? _raw_spin_unlock_irq+0x13/0x20 [6756902.750836] ? finish_task_switch+0x83/0x290 [6756902.750836] process_one_work+0x18a/0x3a0 [6756902.750837] worker_thread+0x50/0x3b0 [6756902.750837] kthread+0x106/0x140 [6756902.750837] ? process_one_work+0x3a0/0x3a0 [6756902.750838] ? kthread_park+0x80/0x80 [6756902.750838] ret_from_fork+0x24/0x50

May 06 '20 19:05 alexanderhaensch

Screenshot_2020-05-07 Node - ZFS - Grafana

Looks like the kworker thread is spinning and reading metadata, while not committing to ZIL. ZIL commit count is slightly reduced since the event at 5/3 2:20 A.M. After a while more and more programs hang behind this kworker thread as one can see in the load graph.

The gap in the graphs is not correlated with this particular system. There was a reconfiguration of the monitoring database.

May 07 '20 07:05 alexanderhaensch

zfs zfs copied to clipboard

ZFS 0.8: kworker sucking up 100% CPU, various messages in logs mentioning ZFS, system eventually hangs.

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard