zfs icon indicating copy to clipboard operation
zfs copied to clipboard

Swap deadlock in 0.7.9

Open runderwo opened this issue 7 years ago • 144 comments

System information

Type Version/Name
Distribution Name Debian
Distribution Version stretch
Linux Kernel 4.16.0-0.bpo.2-amd64
Architecture amd64
ZFS Version 0.7.9-3~bpo9+1
SPL Version 0.7.9-3~bpo9+1

Describe the problem you're observing

System deadlocked while forcing a page out to a swap zvol. Unfortunately I do not have the rest of the backtraces.

Describe how to reproduce the problem

rsync a filesystem with O(1M) files.

Include any warning/errors/backtraces from the system logs

imag0272

NAME        PROPERTY               VALUE                  SOURCE
rpool/swap  type                   volume                 -
rpool/swap  creation               Tue Jun 26 17:52 2018  -
rpool/swap  used                   8.50G                  -
rpool/swap  available              8.84T                  -
rpool/swap  referenced             128K                   -
rpool/swap  compressratio          1.09x                  -
rpool/swap  reservation            none                   default
rpool/swap  volsize                8G                     local
rpool/swap  volblocksize           4K                     -
rpool/swap  checksum               on                     default
rpool/swap  compression            zle                    local
rpool/swap  readonly               off                    default
rpool/swap  createtxg              156102                 -
rpool/swap  copies                 1                      default
rpool/swap  refreservation         8.50G                  local
rpool/swap  guid                   11434631848701244988   -
rpool/swap  primarycache           metadata               local
rpool/swap  secondarycache         all                    default
rpool/swap  usedbysnapshots        0B                     -
rpool/swap  usedbydataset          128K                   -
rpool/swap  usedbychildren         0B                     -
rpool/swap  usedbyrefreservation   8.50G                  -
rpool/swap  logbias                throughput             local
rpool/swap  dedup                  off                    default
rpool/swap  mlslabel               none                   default
rpool/swap  sync                   always                 local
rpool/swap  refcompressratio       1.09x                  -
rpool/swap  written                128K                   -
rpool/swap  logicalused            46K                    -
rpool/swap  logicalreferenced      46K                    -
rpool/swap  volmode                default                default
rpool/swap  snapshot_limit         none                   default
rpool/swap  snapshot_count         none                   default
rpool/swap  snapdev                hidden                 default
rpool/swap  context                none                   default
rpool/swap  fscontext              none                   default
rpool/swap  defcontext             none                   default
rpool/swap  rootcontext            none                   default
rpool/swap  redundant_metadata     all                    default
rpool/swap  com.sun:auto-snapshot  false                  local

runderwo avatar Jul 21 '18 16:07 runderwo

@jgallag88 and I have been observing this as well. Here's our bug report with a simple reproducer:

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 4.15.0-33-generic
Architecture amd64
ZFS Version Master
SPL Version Master

Describe the problem you're observing

With zvol swap devices, high memory operations hang indefinitely despite there being a lot of swap space available. It appears that the hang occurs in ZIL code while ZFS is performing an allocation for a zio as part of its work to swap pages.

We also tested this situation with a ext4 configured swap device and did not see any hangs.

This is the memory usage on one our of typical VMs:

              total        used        free      shared  buff/cache   available
Mem:           7.3G        3.6G        3.1G         32M        603M        3.4G
Swap:          4.0G          0B        4.0G

Reproducing the problem

Configure a zvol as a swap device

zfs create -V 4G -b "$(getconf PAGESIZE)" -o logbias=throughput -o sync=always  -o primarycache=metadata rpool/swap
mkswap -f /dev/zvol/rpool/swap
swapon /dev/zvol/rpool/swap

Run a high memory operation

Running something that will use up all your memory, for example stress: stress --vm 10 --vm-keep --vm-bytes 512M Check that all memory is being used with watch free -h Eventually the system will start to swap but quickly hang and not recover.

Examples

Example of the thread trying to swap out page:

PID: 1      TASK: ffff8cadf6208000  CPU: 1   COMMAND: "systemd"
 #0 [ffffb10840c671f0] __schedule at ffffffffb0594571
 #1 [ffffb10840c67288] schedule at ffffffffb0594bac
 #2 [ffffb10840c67298] schedule_preempt_disabled at ffffffffb0594e8e
 #3 [ffffb10840c672a8] __mutex_lock at ffffffffb059654c
 #4 [ffffb10840c67338] __mutex_lock_slowpath at ffffffffb05968a3
 #5 [ffffb10840c67350] mutex_lock at ffffffffb05968df
 #6 [ffffb10840c67368] zil_commit_writer at ffffffffc0723d21 [zfs]
 #7 [ffffb10840c673a0] zil_commit_impl at ffffffffc0723e66 [zfs]
 #8 [ffffb10840c673c0] zil_commit at ffffffffc0723ef0 [zfs]
 #9 [ffffb10840c673e8] zvol_write at ffffffffc0746e20 [zfs]
#10 [ffffb10840c67488] zvol_request at ffffffffc0747fba [zfs]
#11 [ffffb10840c674d8] generic_make_request at ffffffffb00534e4
#12 [ffffb10840c67538] submit_bio at ffffffffb0053733
#13 [ffffb10840c67588] __swap_writepage at ffffffffafe26e83
#14 [ffffb10840c67620] swap_writepage at ffffffffafe26f44
#15 [ffffb10840c67648] pageout at ffffffffafde683b
#16 [ffffb10840c676d8] shrink_page_list at ffffffffafde9caa
#17 [ffffb10840c677a0] shrink_inactive_list at ffffffffafdea6f2
#18 [ffffb10840c67858] shrink_node_memcg at ffffffffafdeb1a9
#19 [ffffb10840c67958] shrink_node at ffffffffafdeb6b7
#20 [ffffb10840c679e8] do_try_to_free_pages at ffffffffafdeb989
#21 [ffffb10840c67a40] try_to_free_pages at ffffffffafdebcde
#22 [ffffb10840c67ac8] __alloc_pages_slowpath at ffffffffafdd958e
#23 [ffffb10840c67bd8] __alloc_pages_nodemask at ffffffffafdda203
#24 [ffffb10840c67c40] alloc_pages_current at ffffffffafe385da
#25 [ffffb10840c67c70] __page_cache_alloc at ffffffffafdcc131
#26 [ffffb10840c67c90] filemap_fault at ffffffffafdcf978
#27 [ffffb10840c67d48] __do_fault at ffffffffafe0a864
#28 [ffffb10840c67d70] handle_pte_fault at ffffffffafe0f9e3
#29 [ffffb10840c67dc8] __handle_mm_fault at ffffffffafe10598
#30 [ffffb10840c67e70] handle_mm_fault at ffffffffafe10791
#31 [ffffb10840c67ea8] __do_page_fault at ffffffffafc74990
#32 [ffffb10840c67f20] do_page_fault at ffffffffafc74c3e
#33 [ffffb10840c67f50] page_fault at ffffffffb06015e5
    RIP: 00007fcfe704d69a  RSP: 00007ffe9178ead0  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: 000055ca1d2bd8c0  RCX: 00007fcfe753abb7
    RDX: 0000000000000056  RSI: 00007ffe9178ead0  RDI: 0000000000000000
    RBP: 00007ffe9178efe0   R8: 0000000000000000   R9: 61742e6369736162
    R10: 00000000ffffffff  R11: 0000000000000000  R12: 0000000000000001
    R13: ffffffffffffffff  R14: 00007ffe9178ead0  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

Two places where zios were stuck

PID: 570    TASK: ffff91c660c32d80  CPU: 0   COMMAND: "z_wr_int"
 #0 [fffffe0000008cc0] machine_kexec at ffffffff896631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffff8972a479
 #2 [fffffe0000008de8] panic at ffffffff8968c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffff8968c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffff896318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffff89631a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffff89631bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffff8a0019eb
    [exception RIP: putback_inactive_pages+501]
    RIP: ffffffff897ea245  RSP: ffff9fc501e9f658  RFLAGS: 00000003
    RAX: ffff91c677012800  RBX: ffffeb4ac7e52a80  RCX: 0000000000000001
    RDX: ffffeb4ac7380720  RSI: 0000000000000000  RDI: ffff91c677012800
    RBP: ffff9fc501e9f6d8   R8: 0000000000027f20   R9: 0000000000036164
    R10: ffff91c67ffd2000  R11: 00000000000000e5  R12: 0000000000000000
    R13: ffffeb4ac7e52aa0  R14: ffff91c677012800  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffff9fc501e9f658] putback_inactive_pages at ffffffff897ea245
 #9 [ffff9fc501e9f6e0] shrink_inactive_list at ffffffff897ea74d
#10 [ffff9fc501e9f798] shrink_node_memcg at ffffffff897eb1a9
#11 [ffff9fc501e9f898] shrink_node at ffffffff897eb6b7
#12 [ffff9fc501e9f930] do_try_to_free_pages at ffffffff897eb989
#13 [ffff9fc501e9f988] try_to_free_pages at ffffffff897ebcde
#14 [ffff9fc501e9fa10] __alloc_pages_slowpath at ffffffff897d958e
#15 [ffff9fc501e9fb20] __alloc_pages_nodemask at ffffffff897da203
#16 [ffff9fc501e9fb88] __alloc_pages at ffffffffc07851db [zfs]
#17 [ffff9fc501e9fb98] __alloc_pages_node at ffffffffc07851f1 [zfs]
#18 [ffff9fc501e9fba8] alloc_pages_node at ffffffffc0785225 [zfs]
#19 [ffff9fc501e9fbc0] abd_alloc_chunk at ffffffffc078523e [zfs]
#20 [ffff9fc501e9fbd0] abd_alloc_pages.constprop.8 at ffffffffc0785e85 [zfs]
#21 [ffff9fc501e9fc58] abd_alloc at ffffffffc07862f5 [zfs]
#22 [ffff9fc501e9fc80] abd_alloc_for_io at ffffffffc078647e [zfs]
#23 [ffff9fc501e9fc90] vdev_queue_aggregate at ffffffffc086187a [zfs]
#24 [ffff9fc501e9fd00] vdev_queue_io_to_issue at ffffffffc0861e11 [zfs]
#25 [ffff9fc501e9fd48] vdev_queue_io_done at ffffffffc0862582 [zfs]
#26 [ffff9fc501e9fd88] zio_vdev_io_done at ffffffffc08c61cf [zfs]
#27 [ffff9fc501e9fdb0] zio_execute at ffffffffc08c6c2c [zfs]
#28 [ffff9fc501e9fdf8] taskq_thread at ffffffffc065254d [spl]
#29 [ffff9fc501e9ff08] kthread at ffffffff896ae531
#30 [ffff9fc501e9ff50] ret_from_fork at ffffffff8a000205

And the next place:

PID: 567    TASK: ffff97afe0e05b00  CPU: 0   COMMAND: "z_wr_iss"
 #0 [fffffe0000008cc0] machine_kexec at ffffffffa64631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffffa652a479
 #2 [fffffe0000008de8] panic at ffffffffa648c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffffa648c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffffa64318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffffa6431a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffffa6431bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffffa6e019eb
    [exception RIP: page_referenced_one]
    RIP: ffffffffa661e5f0  RSP: ffffa65681e5f3a8  RFLAGS: 00000246
    RAX: ffffffffa661e5f0  RBX: 00000000ffc49000  RCX: ffffa65681e5f418
    RDX: 00000000ffc49000  RSI: ffff97afc231cea0  RDI: ffffdde407197dc0
    RBP: ffffa65681e5f3f8   R8: 0000000000080000   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000004  R12: ffffdde407197dc0
    R13: ffff97afc231cea0  R14: ffffa65681e5f430  R15: ffff97afc104dd80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffffa65681e5f3a8] page_referenced_one at ffffffffa661e5f0
 #9 [ffffa65681e5f3a8] rmap_walk_anon at ffffffffa661e47a
#10 [ffffa65681e5f400] rmap_walk at ffffffffa66203c8
#11 [ffffa65681e5f410] page_referenced at ffffffffa66204bc
#12 [ffffa65681e5f488] shrink_page_list at ffffffffa65e9960
#13 [ffffa65681e5f550] shrink_inactive_list at ffffffffa65ea6f2
#14 [ffffa65681e5f608] shrink_node_memcg at ffffffffa65eb1a9
#15 [ffffa65681e5f708] shrink_node at ffffffffa65eb6b7
#16 [ffffa65681e5f7a0] do_try_to_free_pages at ffffffffa65eb989
#17 [ffffa65681e5f7f8] try_to_free_pages at ffffffffa65ebcde
#18 [ffffa65681e5f880] __alloc_pages_slowpath at ffffffffa65d958e
#19 [ffffa65681e5f990] __alloc_pages_nodemask at ffffffffa65da203
#20 [ffffa65681e5f9f8] alloc_pages_current at ffffffffa66385da
#21 [ffffa65681e5fa28] new_slab at ffffffffa6645884
#22 [ffffa65681e5fa98] ___slab_alloc at ffffffffa66463b7
#23 [ffffa65681e5fb58] __slab_alloc at ffffffffa6646530
#24 [ffffa65681e5fb80] kmem_cache_alloc at ffffffffa66466cb
#25 [ffffa65681e5fbc0] spl_kmem_cache_alloc at ffffffffc06d6a3b [spl]
#26 [ffffa65681e5fbf8] zio_create at ffffffffc0960601 [zfs]
#27 [ffffa65681e5fc50] zio_vdev_child_io at ffffffffc0961fcc [zfs]
#28 [ffffa65681e5fcf8] vdev_mirror_io_start at ffffffffc08f6b4e [zfs]
#29 [ffffa65681e5fd70] zio_vdev_io_start at ffffffffc0966ac7 [zfs]
#30 [ffffa65681e5fdb0] zio_execute at ffffffffc095cc2c [zfs]
#31 [ffffa65681e5fdf8] taskq_thread at ffffffffc06db54d [spl]
#32 [ffffa65681e5ff08] kthread at ffffffffa64ae531
#33 [ffffa65681e5ff50] ret_from_fork at ffffffffa6e00205

shartse avatar Aug 28 '18 19:08 shartse

Can reproduce this as well on:

System information

Type Version/Name
Distribution Name Proxmox VE (based on Debian)
Distribution Version 5.2 / stretch
Linux Kernel 4.15.18-4-pve (based on ubuntu-bionic)
Architecture amd64
ZFS Version 0.7.9-pve3~bpo9
SPL Version 0.7.9-1

swap zvol was created as above with " -o compression=zle-o secondarycache=none" as additional parameters. The system is a small qemu-machine, using a qemu-disk w/o zfs as swap does not lockup the machine.

I could reproduce the issue with plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.7.9 (from stretch-backports) The Problem does not occur on plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.6.5.9

Please let me know if/how I can help in hunting this down/fixing this.

console output w/ hung tasks on 4.15.18-4 (ZFS 0.7.9):

[  846.823830] INFO: task systemd:1 blocked for more than 120 seconds.
[  846.824978]       Tainted: P           O     4.15.18-4-pve #1
[  846.825888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.826774] systemd         D    0     1      0 0x00000000
[  846.827409] Call Trace:
[  846.827722]  __schedule+0x3e0/0x870
[  846.828145]  schedule+0x36/0x80
[  846.828541]  cv_wait_common+0x11e/0x140 [spl]
[  846.829088]  ? wait_woken+0x80/0x80
[  846.829477]  __cv_wait+0x15/0x20 [spl]
[  846.829968]  txg_wait_open+0xb0/0x100 [zfs]
[  846.830455]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.830992]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.831485]  zvol_write+0x175/0x620 [zfs]
[  846.831994]  zvol_request+0x24a/0x300 [zfs]
[  846.832513]  ? SyS_madvise+0xa20/0xa20
[  846.833124]  generic_make_request+0x123/0x2f0
[  846.833908]  submit_bio+0x73/0x140
[  846.834491]  ? submit_bio+0x73/0x140
[  846.835098]  ? get_swap_bio+0xcf/0x100
[  846.835780]  __swap_writepage+0x345/0x3b0
[  846.836463]  ? __frontswap_store+0x73/0x100
[  846.837190]  swap_writepage+0x34/0x90
[  846.838503]  pageout.isra.53+0x1e5/0x330
[  846.839874]  shrink_page_list+0x955/0xb70
[  846.841326]  shrink_inactive_list+0x256/0x5e0
[  846.842768]  ? next_arg+0x80/0x110
[  846.843851]  shrink_node_memcg+0x365/0x780
[  846.845052]  shrink_node+0xe1/0x310
[  846.846113]  ? shrink_node+0xe1/0x310
[  846.847153]  do_try_to_free_pages+0xef/0x360
[  846.848258]  try_to_free_pages+0xf2/0x1b0
[  846.849392]  __alloc_pages_slowpath+0x401/0xf10
[  846.850513]  ? __page_cache_alloc+0x86/0x90
[  846.851612]  __alloc_pages_nodemask+0x25b/0x280
[  846.852784]  alloc_pages_current+0x6a/0xe0
[  846.853903]  __page_cache_alloc+0x86/0x90
[  846.855011]  filemap_fault+0x369/0x740
[  846.856057]  ? page_add_file_rmap+0xf7/0x150
[  846.857482]  ? filemap_map_pages+0x369/0x380
[  846.858618]  ext4_filemap_fault+0x31/0x44
[  846.859708]  __do_fault+0x24/0xe3
[  846.860751]  __handle_mm_fault+0xcd7/0x11e0
[  846.861842]  ? ep_read_events_proc+0xd0/0xd0
[  846.863129]  handle_mm_fault+0xce/0x1b0
[  846.864175]  __do_page_fault+0x25e/0x500
[  846.865251]  ? wake_up_q+0x80/0x80
[  846.866175]  do_page_fault+0x2e/0xe0
[  846.867108]  ? async_page_fault+0x2f/0x50
[  846.868080]  do_async_page_fault+0x1a/0x80
[  846.869078]  async_page_fault+0x45/0x50
[  846.869985] RIP: 0033:0x7f551d2d30a3
[  846.870874] RSP: 002b:00007ffdbd747e28 EFLAGS: 00010246
[  846.872075] RAX: 0000000000000001 RBX: 0000558ebcbf3120 RCX: 00007f551d2d30a3
[  846.873473] RDX: 0000000000000041 RSI: 00007ffdbd747e30 RDI: 0000000000000004
[  846.874811] RBP: 00007ffdbd748240 R08: 431bde82d7b634db R09: 0000000000000100
[  846.876098] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffdbd747e30
[  846.877594] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000003
[  846.878882] INFO: task kswapd0:34 blocked for more than 120 seconds.
[  846.880099]       Tainted: P           O     4.15.18-4-pve #1
[  846.881379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.882873] kswapd0         D    0    34      2 0x80000000
[  846.883989] Call Trace:
[  846.884834]  __schedule+0x3e0/0x870
[  846.885771]  schedule+0x36/0x80
[  846.886722]  cv_wait_common+0x11e/0x140 [spl]
[  846.887853]  ? wait_woken+0x80/0x80
[  846.888889]  __cv_wait+0x15/0x20 [spl]
[  846.889927]  txg_wait_open+0xb0/0x100 [zfs]
[  846.891027]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.892132]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.893185]  zvol_write+0x175/0x620 [zfs]
[  846.894136]  ? avl_add+0x74/0xa0 [zavl]
[  846.895112]  zvol_request+0x24a/0x300 [zfs]
[  846.896124]  ? SyS_madvise+0xa20/0xa20
[  846.897118]  generic_make_request+0x123/0x2f0
[  846.898189]  submit_bio+0x73/0x140
[  846.899171]  ? submit_bio+0x73/0x140
[  846.900115]  ? get_swap_bio+0xcf/0x100
[  846.901183]  __swap_writepage+0x345/0x3b0
[  846.902212]  ? __frontswap_store+0x73/0x100
[  846.903221]  swap_writepage+0x34/0x90
[  846.904185]  pageout.isra.53+0x1e5/0x330
[  846.905207]  shrink_page_list+0x955/0xb70
[  846.906238]  shrink_inactive_list+0x256/0x5e0
[  846.907257]  ? __wake_up+0x13/0x20
[  846.908211]  ? __cv_signal+0x2d/0x40 [spl]
[  846.909286]  ? next_arg+0x80/0x110
[  846.910237]  shrink_node_memcg+0x365/0x780
[  846.911275]  shrink_node+0xe1/0x310
[  846.912253]  ? shrink_node+0xe1/0x310
[  846.913273]  kswapd+0x386/0x770
[  846.914213]  kthread+0x105/0x140
[  846.915168]  ? mem_cgroup_shrink_node+0x180/0x180
[  846.916326]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.917519]  ret_from_fork+0x35/0x40
[  846.918631] INFO: task jbd2/sda1-8:196 blocked for more than 120 seconds.
[  846.920121]       Tainted: P           O     4.15.18-4-pve #1
[  846.921514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.923138] jbd2/sda1-8     D    0   196      2 0x80000000
[  846.924465] Call Trace:
[  846.925372]  __schedule+0x3e0/0x870
[  846.926425]  schedule+0x36/0x80
[  846.927439]  cv_wait_common+0x11e/0x140 [spl]
[  846.928567]  ? wait_woken+0x80/0x80
[  846.929635]  __cv_wait+0x15/0x20 [spl]
[  846.930714]  txg_wait_open+0xb0/0x100 [zfs]
[  846.931826]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.932937]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.934063]  zvol_write+0x175/0x620 [zfs]
[  846.935090]  ? avl_find+0x5f/0xa0 [zavl]
[  846.936147]  zvol_request+0x24a/0x300 [zfs]
[  846.937292]  ? SyS_madvise+0xa20/0xa20
[  846.938274]  generic_make_request+0x123/0x2f0
[  846.939445]  submit_bio+0x73/0x140
[  846.940395]  ? submit_bio+0x73/0x140
[  846.941384]  ? get_swap_bio+0xcf/0x100
[  846.942427]  __swap_writepage+0x345/0x3b0
[  846.943393]  ? __frontswap_store+0x73/0x100
[  846.944395]  swap_writepage+0x34/0x90
[  846.945419]  pageout.isra.53+0x1e5/0x330
[  846.946424]  shrink_page_list+0x955/0xb70
[  846.947423]  shrink_inactive_list+0x256/0x5e0
[  846.948465]  ? next_arg+0x80/0x110
[  846.949443]  shrink_node_memcg+0x365/0x780
[  846.950518]  ? _cond_resched+0x1a/0x50
[  846.951533]  shrink_node+0xe1/0x310
[  846.952469]  ? shrink_node+0xe1/0x310
[  846.953446]  do_try_to_free_pages+0xef/0x360
[  846.954449]  try_to_free_pages+0xf2/0x1b0
[  846.955670]  __alloc_pages_slowpath+0x401/0xf10
[  846.956738]  ? ext4_map_blocks+0x436/0x5d0
[  846.957745]  ? __switch_to_asm+0x34/0x70
[  846.958753]  __alloc_pages_nodemask+0x25b/0x280
[  846.959807]  alloc_pages_current+0x6a/0xe0
[  846.960860]  __page_cache_alloc+0x86/0x90
[  846.961867]  pagecache_get_page+0xab/0x2b0
[  846.962968]  __getblk_gfp+0x109/0x300
[  846.964009]  jbd2_journal_get_descriptor_buffer+0x5e/0xe0
[  846.965220]  journal_submit_commit_record+0x84/0x200
[  846.966361]  ? wait_woken+0x80/0x80
[  846.967350]  jbd2_journal_commit_transaction+0x1299/0x1720
[  846.968574]  ? __switch_to_asm+0x34/0x70
[  846.969665]  ? __switch_to_asm+0x40/0x70
[  846.970672]  ? finish_task_switch+0x74/0x200
[  846.971787]  kjournald2+0xc8/0x260
[  846.972772]  ? kjournald2+0xc8/0x260
[  846.973770]  ? wait_woken+0x80/0x80
[  846.974696]  kthread+0x105/0x140
[  846.975568]  ? commit_timeout+0x20/0x20
[  846.976555]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.977751]  ? do_syscall_64+0x73/0x130
[  846.978746]  ? SyS_exit_group+0x14/0x20
[  846.979736]  ret_from_fork+0x35/0x40
[  846.980747] INFO: task systemd-journal:236 blocked for more than 120 seconds.
[  846.982049]       Tainted: P           O     4.15.18-4-pve #1
[  846.983182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.984707] systemd-journal D    0   236      1 0x00000120
[  846.986238] Call Trace:
[  846.987172]  __schedule+0x3e0/0x870
[  846.988109]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  846.989188]  schedule+0x36/0x80
[  846.990073]  cv_wait_common+0x11e/0x140 [spl]
[  846.991083]  ? wait_woken+0x80/0x80
[  846.992044]  __cv_wait+0x15/0x20 [spl]
[  846.993046]  txg_wait_open+0xb0/0x100 [zfs]
[  846.994018]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.994985]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.996196]  zvol_write+0x175/0x620 [zfs]
[  846.997469]  zvol_request+0x24a/0x300 [zfs]
[  846.998562]  ? SyS_madvise+0xa20/0xa20
[  846.999542]  generic_make_request+0x123/0x2f0
[  847.000637]  submit_bio+0x73/0x140
[  847.001711]  ? submit_bio+0x73/0x140
[  847.002674]  ? get_swap_bio+0xcf/0x100
[  847.003655]  __swap_writepage+0x345/0x3b0
[  847.004788]  ? __frontswap_store+0x73/0x100
[  847.005816]  swap_writepage+0x34/0x90
[  847.006733]  pageout.isra.53+0x1e5/0x330
[  847.007692]  shrink_page_list+0x955/0xb70
[  847.008845]  shrink_inactive_list+0x256/0x5e0
[  847.010100]  ? next_arg+0x80/0x110
[  847.011214]  shrink_node_memcg+0x365/0x780
[  847.012477]  shrink_node+0xe1/0x310
[  847.013530]  ? shrink_node+0xe1/0x310
[  847.014567]  do_try_to_free_pages+0xef/0x360
[  847.015686]  try_to_free_pages+0xf2/0x1b0
[  847.016894]  __alloc_pages_slowpath+0x401/0xf10
[  847.017958]  ? __page_cache_alloc+0x86/0x90
[  847.018954]  __alloc_pages_nodemask+0x25b/0x280
[  847.020080]  alloc_pages_current+0x6a/0xe0
[  847.021240]  __page_cache_alloc+0x86/0x90
[  847.022334]  filemap_fault+0x369/0x740
[  847.023299]  ? page_add_file_rmap+0xf7/0x150
[  847.024324]  ? filemap_map_pages+0x369/0x380
[  847.025381]  ext4_filemap_fault+0x31/0x44
[  847.026388]  __do_fault+0x24/0xe3
[  847.027337]  __handle_mm_fault+0xcd7/0x11e0
[  847.028382]  ? ep_read_events_proc+0xd0/0xd0
[  847.029472]  handle_mm_fault+0xce/0x1b0
[  847.030504]  __do_page_fault+0x25e/0x500
[  847.031518]  ? wake_up_q+0x80/0x80
[  847.032505]  do_page_fault+0x2e/0xe0
[  847.033544]  ? async_page_fault+0x2f/0x50
[  847.034608]  do_async_page_fault+0x1a/0x80
[  847.035740]  async_page_fault+0x45/0x50
[  847.036859] RIP: 0033:0x7f5e14a430c3
[  847.037888] RSP: 002b:00007ffde5fb7fd0 EFLAGS: 00010293
[  847.039163] RAX: 0000000000000001 RBX: 000056164a08f1e0 RCX: 00007f5e14a430c3
[  847.040736] RDX: 000000000000001c RSI: 00007ffde5fb7fe0 RDI: 0000000000000008
[  847.042378] RBP: 00007ffde5fb8230 R08: 431bde82d7b634db R09: 000000f966b2e318
[  847.043991] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007ffde5fb7fe0
[  847.045640] R13: 0000000000000001 R14: ffffffffffffffff R15: 0005750c02d60440
[  847.047266] INFO: task systemd-udevd:255 blocked for more than 120 seconds.
[  847.048893]       Tainted: P           O     4.15.18-4-pve #1
[  847.050532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.052367] systemd-udevd   D    0   255      1 0x00000120
[  847.053924] Call Trace:
[  847.054897]  __schedule+0x3e0/0x870
[  847.056031]  schedule+0x36/0x80
[  847.057129]  cv_wait_common+0x11e/0x140 [spl]
[  847.058391]  ? wait_woken+0x80/0x80
[  847.059493]  __cv_wait+0x15/0x20 [spl]
[  847.060683]  txg_wait_open+0xb0/0x100 [zfs]
[  847.061979]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.063203]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.064447]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.065775]  dmu_sync_late_arrival+0x53/0x150 [zfs]
[  847.067099]  ? dmu_write_policy+0xca/0x350 [zfs]
[  847.068394]  dmu_sync+0x3b4/0x490 [zfs]
[  847.069756]  ? dbuf_hold+0x33/0x60 [zfs]
[  847.070998]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.072266]  zvol_get_data+0x164/0x180 [zfs]
[  847.073532]  zil_commit.part.14+0x451/0x8b0 [zfs]
[  847.074855]  zil_commit+0x17/0x20 [zfs]
[  847.076040]  zvol_write+0x5a2/0x620 [zfs]
[  847.077285]  zvol_request+0x24a/0x300 [zfs]
[  847.078528]  ? SyS_madvise+0xa20/0xa20
[  847.079677]  generic_make_request+0x123/0x2f0
[  847.080963]  submit_bio+0x73/0x140
[  847.082058]  ? submit_bio+0x73/0x140
[  847.083093]  ? get_swap_bio+0xcf/0x100
[  847.084135]  __swap_writepage+0x345/0x3b0
[  847.085284]  ? __frontswap_store+0x73/0x100
[  847.086424]  swap_writepage+0x34/0x90
[  847.087471]  pageout.isra.53+0x1e5/0x330
[  847.088565]  shrink_page_list+0x955/0xb70
[  847.089688]  shrink_inactive_list+0x256/0x5e0
[  847.090715]  ? next_arg+0x80/0x110
[  847.091670]  shrink_node_memcg+0x365/0x780
[  847.092808]  shrink_node+0xe1/0x310
[  847.093789]  ? shrink_node+0xe1/0x310
[  847.094739]  do_try_to_free_pages+0xef/0x360
[  847.095766]  try_to_free_pages+0xf2/0x1b0
[  847.096834]  __alloc_pages_slowpath+0x401/0xf10
[  847.097863]  ? security_file_open+0x90/0xa0
[  847.098852]  ? terminate_walk+0x91/0xf0
[  847.099919]  __alloc_pages_nodemask+0x25b/0x280
[  847.101087]  alloc_pages_current+0x6a/0xe0
[  847.102101]  __get_free_pages+0xe/0x30
[  847.103044]  pgd_alloc+0x1e/0x170
[  847.103941]  mm_init+0x197/0x280
[  847.104935]  copy_process.part.35+0xa50/0x1ab0
[  847.106029]  ? __seccomp_filter+0x49/0x540
[  847.107024]  ? _raw_spin_unlock_bh+0x1e/0x20
[  847.108080]  _do_fork+0xdf/0x3f0
[  847.109080]  ? __secure_computing+0x3f/0x100
[  847.110242]  ? syscall_trace_enter+0xca/0x2e0
[  847.111348]  SyS_clone+0x19/0x20
[  847.112435]  do_syscall_64+0x73/0x130
[  847.113606]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.114968] RIP: 0033:0x7f7ae276a38b
[  847.116137] RSP: 002b:00007ffea1b4b950 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  847.117728] RAX: ffffffffffffffda RBX: 00007ffea1b4b950 RCX: 00007f7ae276a38b
[  847.119429] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  847.121162] RBP: 00007ffea1b4b9a0 R08: 00007f7ae39148c0 R09: 0000000000000210
[  847.122879] R10: 00007f7ae3914b90 R11: 0000000000000246 R12: 0000000000000000
[  847.124640] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  847.126237] INFO: task systemd-timesyn:2623 blocked for more than 120 seconds.
[  847.127856]       Tainted: P           O     4.15.18-4-pve #1
[  847.129293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.130993] systemd-timesyn D    0  2623      1 0x00000120
[  847.132390] Call Trace:
[  847.133404]  __schedule+0x3e0/0x870
[  847.134513]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  847.135783]  schedule+0x36/0x80
[  847.136876]  cv_wait_common+0x11e/0x140 [spl]
[  847.138128]  ? wait_woken+0x80/0x80
[  847.139215]  __cv_wait+0x15/0x20 [spl]
[  847.140503]  txg_wait_open+0xb0/0x100 [zfs]
[  847.141783]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.142980]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.144255]  zvol_write+0x175/0x620 [zfs]
[  847.145531]  ? avl_find+0x5f/0xa0 [zavl]
[  847.146818]  zvol_request+0x24a/0x300 [zfs]
[  847.166365]  ? SyS_madvise+0xa20/0xa20
[  847.167706]  generic_make_request+0x123/0x2f0
[  847.169118]  submit_bio+0x73/0x140
[  847.170263]  ? submit_bio+0x73/0x140
[  847.171385]  ? get_swap_bio+0xcf/0x100
[  847.172468]  __swap_writepage+0x345/0x3b0
[  847.173624]  ? __frontswap_store+0x73/0x100
[  847.174690]  swap_writepage+0x34/0x90
[  847.175697]  pageout.isra.53+0x1e5/0x330
[  847.176840]  shrink_page_list+0x955/0xb70
[  847.177948]  shrink_inactive_list+0x256/0x5e0
[  847.179101]  ? next_arg+0x80/0x110
[  847.180086]  shrink_node_memcg+0x365/0x780
[  847.181168]  shrink_node+0xe1/0x310
[  847.182126]  ? shrink_node+0xe1/0x310
[  847.183114]  do_try_to_free_pages+0xef/0x360
[  847.184216]  try_to_free_pages+0xf2/0x1b0
[  847.185362]  __alloc_pages_slowpath+0x401/0xf10
[  847.186524]  ? dev_queue_xmit+0x10/0x20
[  847.187576]  ? br_dev_queue_push_xmit+0x7a/0x140
[  847.188691]  __alloc_pages_nodemask+0x25b/0x280
[  847.189891]  alloc_pages_vma+0x88/0x1c0
[  847.190975]  __read_swap_cache_async+0x147/0x200
[  847.192166]  read_swap_cache_async+0x2b/0x60
[  847.193337]  swapin_readahead+0x22f/0x2b0
[  847.194369]  ? radix_tree_lookup_slot+0x22/0x50
[  847.195408]  ? find_get_entry+0x1e/0x100
[  847.196396]  ? pagecache_get_page+0x2c/0x2b0
[  847.197482]  do_swap_page+0x52d/0x9b0
[  847.198517]  ? do_swap_page+0x52d/0x9b0
[  847.199527]  ? crypto_shash_update+0x47/0x130
[  847.200539]  __handle_mm_fault+0x88d/0x11e0
[  847.201564]  ? jbd2_journal_dirty_metadata+0x22d/0x290
[  847.202735]  handle_mm_fault+0xce/0x1b0
[  847.203761]  __do_page_fault+0x25e/0x500
[  847.204766]  ? __switch_to_asm+0x34/0x70
[  847.205728]  do_page_fault+0x2e/0xe0
[  847.206648]  do_async_page_fault+0x1a/0x80
[  847.207715]  async_page_fault+0x25/0x50
[  847.208725] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.209764] RSP: 0000:ffffad16810c3d80 EFLAGS: 00010246
[  847.210821] RAX: 0000000000000001 RBX: ffffad16810c3df8 RCX: 00007fffc09eaec0
[  847.212099] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.213504] RBP: ffffad16810c3dd8 R08: 00000000000002c6 R09: ffff9d25fb9f5f98
[  847.214822] R10: ffffad16810c3cc0 R11: 0000000000000038 R12: 0000000000000000
[  847.216193] R13: ffffad16810c3e78 R14: ffff9d25fb9b89c0 R15: ffff9d25fb9f5f98
[  847.217572]  ? ep_read_events_proc+0xd0/0xd0
[  847.218601]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.219743]  ep_poll+0x1f8/0x3b0
[  847.220735]  ? wake_up_q+0x80/0x80
[  847.221659]  SyS_epoll_wait+0xce/0xf0
[  847.222683]  do_syscall_64+0x73/0x130
[  847.223681]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.224859] RIP: 0033:0x7fe0d40920c3
[  847.225852] RSP: 002b:00007fffc09eaeb0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[  847.227252] RAX: ffffffffffffffda RBX: 000056492a0138f0 RCX: 00007fe0d40920c3
[  847.228575] RDX: 0000000000000006 RSI: 00007fffc09eaec0 RDI: 0000000000000004
[  847.229952] RBP: 00007fffc09eb010 R08: 431bde82d7b634db R09: 00000000ffffffff
[  847.231397] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007fffc09eaec0
[  847.232823] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000000
[  847.234199] INFO: task rpcbind:2624 blocked for more than 120 seconds.
[  847.235669]       Tainted: P           O     4.15.18-4-pve #1
[  847.237042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.242089] rpcbind         D    0  2624      1 0x00000000
[  847.244049] Call Trace:
[  847.245324]  __schedule+0x3e0/0x870
[  847.246883]  schedule+0x36/0x80
[  847.247957]  cv_wait_common+0x11e/0x140 [spl]
[  847.249063]  ? wait_woken+0x80/0x80
[  847.250169]  __cv_wait+0x15/0x20 [spl]
[  847.251423]  txg_wait_open+0xb0/0x100 [zfs]
[  847.252706]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.254264]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.255518]  zvol_write+0x175/0x620 [zfs]
[  847.256706]  zvol_request+0x24a/0x300 [zfs]
[  847.258092]  ? SyS_madvise+0xa20/0xa20
[  847.259682]  generic_make_request+0x123/0x2f0
[  847.261515]  submit_bio+0x73/0x140
[  847.263119]  ? submit_bio+0x73/0x140
[  847.264676]  ? get_swap_bio+0xcf/0x100
[  847.266047]  __swap_writepage+0x345/0x3b0
[  847.267622]  ? __frontswap_store+0x73/0x100
[  847.269369]  swap_writepage+0x34/0x90
[  847.270991]  pageout.isra.53+0x1e5/0x330
[  847.272710]  shrink_page_list+0x955/0xb70
[  847.274195]  shrink_inactive_list+0x256/0x5e0
[  847.275618]  ? next_arg+0x80/0x110
[  847.276909]  shrink_node_memcg+0x365/0x780
[  847.278354]  shrink_node+0xe1/0x310
[  847.279636]  ? shrink_node+0xe1/0x310
[  847.280971]  do_try_to_free_pages+0xef/0x360
[  847.282448]  try_to_free_pages+0xf2/0x1b0
[  847.283825]  __alloc_pages_slowpath+0x401/0xf10
[  847.285329]  ? __hrtimer_init+0xa0/0xa0
[  847.286574]  ? poll_freewait+0x4a/0xb0
[  847.287577]  __alloc_pages_nodemask+0x25b/0x280
[  847.288654]  alloc_pages_current+0x6a/0xe0
[  847.289718]  __page_cache_alloc+0x86/0x90
[  847.290707]  __do_page_cache_readahead+0x10e/0x2d0
[  847.291834]  ? radix_tree_lookup_slot+0x22/0x50
[  847.293072]  ? __intel_pmu_enable_all.constprop.19+0x4d/0x80
[  847.294357]  ? find_get_entry+0x1e/0x100
[  847.295449]  filemap_fault+0x571/0x740
[  847.296492]  ? filemap_fault+0x571/0x740
[  847.297612]  ? filemap_map_pages+0x180/0x380
[  847.298682]  ext4_filemap_fault+0x31/0x44
[  847.299784]  __do_fault+0x24/0xe3
[  847.300829]  __handle_mm_fault+0xcd7/0x11e0
[  847.301910]  handle_mm_fault+0xce/0x1b0
[  847.302926]  __do_page_fault+0x25e/0x500
[  847.303946]  ? ktime_get_ts64+0x51/0xf0
[  847.305059]  do_page_fault+0x2e/0xe0
[  847.306133]  ? async_page_fault+0x2f/0x50
[  847.307158]  do_async_page_fault+0x1a/0x80
[  847.308292]  async_page_fault+0x45/0x50
[  847.309415] RIP: 0033:0x7fc96419d660
[  847.310441] RSP: 002b:00007fff1212c788 EFLAGS: 00010246
[  847.311917] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc96419d660
[  847.313598] RDX: 0000000000007530 RSI: 0000000000000007 RDI: 00007fff1212c9d0
[  847.315185] RBP: 0000000000000007 R08: 00000000000000c3 R09: 000000000000000b
[  847.316666] R10: 00007fff1212c700 R11: 0000000000000246 R12: 000000000000000c
[  847.318208] R13: 000000000000000c R14: 0000558af8c4d480 R15: 00007fff1212c810
[  847.319523] INFO: task systemd-logind:2759 blocked for more than 120 seconds.
[  847.320812]       Tainted: P           O     4.15.18-4-pve #1
[  847.322042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.323441] systemd-logind  D    0  2759      1 0x00000120
[  847.324575] Call Trace:
[  847.325493]  __schedule+0x3e0/0x870
[  847.326482]  schedule+0x36/0x80
[  847.327482]  cv_wait_common+0x11e/0x140 [spl]
[  847.328683]  ? wait_woken+0x80/0x80
[  847.329778]  __cv_wait+0x15/0x20 [spl]
[  847.330970]  txg_wait_open+0xb0/0x100 [zfs]
[  847.332121]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.333402]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.334552]  zvol_write+0x175/0x620 [zfs]
[  847.335630]  zvol_request+0x24a/0x300 [zfs]
[  847.336841]  ? SyS_madvise+0xa20/0xa20
[  847.337971]  generic_make_request+0x123/0x2f0
[  847.339185]  submit_bio+0x73/0x140
[  847.340308]  ? submit_bio+0x73/0x140
[  847.341486]  ? get_swap_bio+0xcf/0x100
[  847.342568]  __swap_writepage+0x345/0x3b0
[  847.343733]  ? __frontswap_store+0x73/0x100
[  847.344923]  swap_writepage+0x34/0x90
[  847.345982]  pageout.isra.53+0x1e5/0x330
[  847.347115]  shrink_page_list+0x955/0xb70
[  847.348290]  shrink_inactive_list+0x256/0x5e0
[  847.349570]  ? virtqueue_add_sgs+0x3b0/0x490
[  847.350704]  ? next_arg+0x80/0x110
[  847.351733]  shrink_node_memcg+0x365/0x780
[  847.352902]  shrink_node+0xe1/0x310
[  847.353928]  ? shrink_node+0xe1/0x310
[  847.355200]  do_try_to_free_pages+0xef/0x360
[  847.356538]  try_to_free_pages+0xf2/0x1b0
[  847.357638]  __alloc_pages_slowpath+0x401/0xf10
[  847.358760]  ? __update_load_avg_se.isra.38+0x1bc/0x1d0
[  847.360134]  ? __switch_to_asm+0x40/0x70
[  847.361353]  ? __switch_to_asm+0x34/0x70
[  847.362495]  ? __switch_to_asm+0x34/0x70
[  847.363611]  ? __switch_to_asm+0x40/0x70
[  847.364668]  __alloc_pages_nodemask+0x25b/0x280
[  847.365826]  alloc_pages_vma+0x88/0x1c0
[  847.366881]  __read_swap_cache_async+0x147/0x200
[  847.368331]  read_swap_cache_async+0x2b/0x60
[  847.369577]  swapin_readahead+0x22f/0x2b0
[  847.370702]  ? radix_tree_lookup_slot+0x22/0x50
[  847.371913]  ? find_get_entry+0x1e/0x100
[  847.373118]  ? pagecache_get_page+0x2c/0x2b0
[  847.374463]  do_swap_page+0x52d/0x9b0
[  847.375610]  ? do_swap_page+0x52d/0x9b0
[  847.376802]  ? __wake_up_sync_key+0x1e/0x30
[  847.378066]  __handle_mm_fault+0x88d/0x11e0
[  847.379291]  handle_mm_fault+0xce/0x1b0
[  847.380417]  __do_page_fault+0x25e/0x500
[  847.381577]  ? __switch_to_asm+0x34/0x70
[  847.382513]  do_page_fault+0x2e/0xe0
[  847.383370]  do_async_page_fault+0x1a/0x80
[  847.384302]  async_page_fault+0x25/0x50
[  847.385225] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.386258] RSP: 0000:ffffad168108bd80 EFLAGS: 00010246
[  847.387335] RAX: 0000000000000001 RBX: ffffad168108bdf8 RCX: 00007fffea7798b0
[  847.388657] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.390106] RBP: ffffad168108bdd8 R08: 000000000000030a R09: ffff9d25f5dfc218
[  847.391442] R10: ffffad168108bcc0 R11: 0000000000000000 R12: 0000000000000000
[  847.392743] R13: ffffad168108be78 R14: ffff9d25faea20c0 R15: ffff9d25f5dfc218
[  847.394189]  ? ep_read_events_proc+0xd0/0xd0
[  847.395244]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.396415]  ep_poll+0x1f8/0x3b0
[  847.397387]  ? wake_up_q+0x80/0x80
[  847.398331]  SyS_epoll_wait+0xce/0xf0
[  847.399313]  do_syscall_64+0x73/0x130
[  847.400232]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.401329] RIP: 0033:0x7fc20e6080a3
[  847.402292] RSP: 002b:00007fffea7798a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[  847.403817] RAX: ffffffffffffffda RBX: 0000560a68ae1260 RCX: 00007fc20e6080a3
[  847.405311] RDX: 000000000000000d RSI: 00007fffea7798b0 RDI: 0000000000000004
[  847.406765] RBP: 00007fffea779a50 R08: 0000560a68aed6e0 R09: 0000000000000000
[  847.408354] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fffea7798b0
[  847.410017] R13: 0000000000000001 R14: ffffffffffffffff R15: 00007fffea779b50
[  847.411383] INFO: task qemu-ga:2842 blocked for more than 120 seconds.
[  847.413009]       Tainted: P           O     4.15.18-4-pve #1
[  847.414513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.416280] qemu-ga         D    0  2842      1 0x00000000
[  847.417779] Call Trace:
[  847.418769]  __schedule+0x3e0/0x870
[  847.419916]  schedule+0x36/0x80
[  847.421044]  cv_wait_common+0x11e/0x140 [spl]
[  847.422126]  ? wait_woken+0x80/0x80
[  847.423130]  __cv_wait+0x15/0x20 [spl]
[  847.424295]  zil_commit.part.14+0x86/0x8b0 [zfs]
[  847.425726]  ? spl_kmem_free+0x33/0x40 [spl]
[  847.427098]  ? zfs_range_unlock+0x1b3/0x2e0 [zfs]
[  847.428477]  zil_commit+0x17/0x20 [zfs]
[  847.429738]  zvol_write+0x5a2/0x620 [zfs]
[  847.430951]  ? avl_find+0x5f/0xa0 [zavl]
[  847.432198]  zvol_request+0x24a/0x300 [zfs]
[  847.433506]  ? SyS_madvise+0xa20/0xa20
[  847.434707]  generic_make_request+0x123/0x2f0
[  847.435984]  submit_bio+0x73/0x140
[  847.437152]  ? submit_bio+0x73/0x140
[  847.438330]  ? get_swap_bio+0xcf/0x100
[  847.439487]  __swap_writepage+0x345/0x3b0
[  847.440700]  ? __frontswap_store+0x73/0x100
[  847.441983]  swap_writepage+0x34/0x90
[  847.443175]  pageout.isra.53+0x1e5/0x330
[  847.444443]  shrink_page_list+0x955/0xb70
[  847.445809]  shrink_inactive_list+0x256/0x5e0
[  847.447055]  ? next_arg+0x80/0x110
[  847.448177]  shrink_node_memcg+0x365/0x780
[  847.449443]  shrink_node+0xe1/0x310
[  847.450563]  ? shrink_node+0xe1/0x310
[  847.451733]  do_try_to_free_pages+0xef/0x360
[  847.452995]  try_to_free_pages+0xf2/0x1b0
[  847.454159]  __alloc_pages_slowpath+0x401/0xf10
[  847.455441]  ? __page_cache_alloc+0x86/0x90
[  847.456679]  __alloc_pages_nodemask+0x25b/0x280
[  847.457968]  alloc_pages_current+0x6a/0xe0
[  847.459194]  __page_cache_alloc+0x86/0x90
[  847.460374]  filemap_fault+0x369/0x740
[  847.461542]  ? __switch_to_asm+0x40/0x70
[  847.462696]  ? __switch_to_asm+0x34/0x70
[  847.463843]  ? filemap_map_pages+0x180/0x380
[  847.465061]  ? __switch_to_asm+0x34/0x70
[  847.466211]  ext4_filemap_fault+0x31/0x44
[  847.467346]  __do_fault+0x24/0xe3
[  847.468398]  __handle_mm_fault+0xcd7/0x11e0
[  847.469631]  handle_mm_fault+0xce/0x1b0
[  847.470750]  __do_page_fault+0x25e/0x500
[  847.471897]  do_page_fault+0x2e/0xe0
[  847.472997]  ? async_page_fault+0x2f/0x50
[  847.474089]  do_async_page_fault+0x1a/0x80
[  847.475205]  async_page_fault+0x45/0x50
[  847.476285] RIP: 0033:0x55c573934927
[  847.477358] RSP: 002b:00007ffee4db6630 EFLAGS: 00010206
[  847.478611] RAX: 0000000000000000 RBX: 000055c575723e20 RCX: 00007ff976bbc270
[  847.480176] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffee4db6610
[  847.481798] RBP: 00007ffee4db6640 R08: 000055c57572a0c0 R09: 0000000000000000
[  847.483337] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c57572a9a0
[  847.484893] R13: 000055c57572a2a0 R14: 000055c57572ad60 R15: 00007ff97766b610
[  847.486492] INFO: task rrdcached:3267 blocked for more than 120 seconds.
[  847.487976]       Tainted: P           O     4.15.18-4-pve #1
[  847.489368] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.491033] rrdcached       D    0  3267      1 0x00000000
[  847.492429] Call Trace:
[  847.493452]  __schedule+0x3e0/0x870
[  847.494556]  schedule+0x36/0x80
[  847.495609]  cv_wait_common+0x11e/0x140 [spl]
[  847.496841]  ? wait_woken+0x80/0x80
[  847.497976]  __cv_wait+0x15/0x20 [spl]
[  847.499154]  txg_wait_open+0xb0/0x100 [zfs]
[  847.500355]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.501566]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.502758]  zvol_write+0x175/0x620 [zfs]
[  847.503976]  zvol_request+0x24a/0x300 [zfs]
[  847.505245]  ? SyS_madvise+0xa20/0xa20
[  847.506371]  generic_make_request+0x123/0x2f0
[  847.507549]  submit_bio+0x73/0x140
[  847.508608]  ? submit_bio+0x73/0x140
[  847.509801]  ? get_swap_bio+0xcf/0x100
[  847.510921]  __swap_writepage+0x345/0x3b0
[  847.512140]  ? __frontswap_store+0x73/0x100
[  847.513384]  swap_writepage+0x34/0x90
[  847.514525]  pageout.isra.53+0x1e5/0x330
[  847.515632]  shrink_page_list+0x955/0xb70
[  847.516772]  shrink_inactive_list+0x256/0x5e0
[  847.518015]  ? syscall_return_via_sysret+0x1a/0x7a
[  847.519363]  ? next_arg+0x80/0x110
[  847.520513]  shrink_node_memcg+0x365/0x780
[  847.521717]  shrink_node+0xe1/0x310
[  847.522846]  ? shrink_node+0xe1/0x310
[  847.523949]  do_try_to_free_pages+0xef/0x360
[  847.525182]  try_to_free_pages+0xf2/0x1b0
[  847.526436]  __alloc_pages_slowpath+0x401/0xf10
[  847.527735]  ? __page_cache_alloc+0x86/0x90
[  847.528986]  __alloc_pages_nodemask+0x25b/0x280
[  847.530283]  alloc_pages_current+0x6a/0xe0
[  847.531307]  __page_cache_alloc+0x86/0x90
[  847.532306]  filemap_fault+0x369/0x740
[  847.533308]  ? filemap_map_pages+0x27a/0x380
[  847.534323]  ext4_filemap_fault+0x31/0x44
[  847.535333]  __do_fault+0x24/0xe3
[  847.536304]  __handle_mm_fault+0xcd7/0x11e0
[  847.537400]  ? update_cfs_group+0xc4/0xe0
[  847.538423]  handle_mm_fault+0xce/0x1b0
[  847.539484]  __do_page_fault+0x25e/0x500
[  847.540584]  ? ktime_get_ts64+0x51/0xf0
[  847.541690]  do_page_fault+0x2e/0xe0
[  847.542714]  ? async_page_fault+0x2f/0x50
[  847.543785]  do_async_page_fault+0x1a/0x80
[  847.544885]  async_page_fault+0x45/0x50
[  847.545894] RIP: 0033:0x7f40f2d6667d
[  847.546870] RSP: 002b:00007fffcec77460 EFLAGS: 00010293
[  847.548075] RAX: 0000000000000000 RBX: 0000556934dc0c10 RCX: 00007f40f2d6667d
[  847.549599] RDX: 00000000000003e8 RSI: 0000000000000001 RDI: 0000556934dae6e0
[  847.551060] RBP: 0000000000000001 R08: 00007f40e9f24700 R09: 00007f40e9f24700
[  847.552554] R10: 00000000000008ba R11: 0000000000000293 R12: 0000556934c28360
[  847.554038] R13: 0000556934dae6e0 R14: 0000000000000000 R15: 0000000000000001

siv0 avatar Sep 04 '18 14:09 siv0

If possible could you dump the stack from the txg_sync process when this happens.

behlendorf avatar Sep 05 '18 00:09 behlendorf

Managed to get the following by running sleep 60; echo t > /proc/sysrq-trigger, before starting stress:

[10981.102235] txg_sync        R  running task        0   606      2 0x80000000
[10981.103613] Call Trace:
[10981.104615]  <IRQ>
[10981.105989]  sched_show_task+0xfe/0x130
[10981.107636]  show_state_filter+0x62/0xd0
[10981.108730]  sysrq_handle_showstate+0x10/0x20
[10981.109813]  __handle_sysrq+0x106/0x170
[10981.110859]  sysrq_filter+0x9b/0x390
[10981.111841]  input_to_handler+0x61/0x100
[10981.112901]  input_pass_values.part.6+0x117/0x130
[10981.114128]  input_handle_event+0x137/0x510
[10981.115171]  input_event+0x54/0x80
[10981.116220]  atkbd_interrupt+0x4b6/0x7a0
[10981.117176]  serio_interrupt+0x4c/0x90
[10981.118211]  i8042_interrupt+0x1f4/0x3b0
[10981.119180]  __handle_irq_event_percpu+0x84/0x1a0
[10981.120324]  handle_irq_event_percpu+0x32/0x80
[10981.121878]  handle_irq_event+0x3b/0x60
[10981.123547]  handle_edge_irq+0x78/0x1a0
[10981.124862]  handle_irq+0x20/0x30
[10981.125899]  do_IRQ+0x4e/0xd0
[10981.126730]  common_interrupt+0x84/0x84
[10981.127697]  </IRQ>
[10981.128422] RIP: 0010:page_vma_mapped_walk+0x49e/0x770
[10981.129488] RSP: 0000:ffffc3310106f188 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd6
[10981.130916] RAX: ffff9e7c795f6a80 RBX: ffffc3310106f1d8 RCX: ffff9e7c40000a80
[10981.132363] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 0000000000000000
[10981.133810] RBP: ffffc3310106f1b8 R08: 0000000000000067 R09: 0000000000000000
[10981.135262] R10: 0000000000000000 R11: 0000000000000091 R12: ffffe7bd40949440
[10981.136754] R13: ffff9e7c79468cc0 R14: 0000000000000988 R15: ffffc3310106f1d8
[10981.138596]  ? page_vma_mapped_walk+0x204/0x770
[10981.140413]  page_referenced_one+0x91/0x190
[10981.141780]  rmap_walk_anon+0x113/0x270
[10981.142918]  rmap_walk+0x48/0x60
[10981.143998]  page_referenced+0x10d/0x170
[10981.145189]  ? rmap_walk_anon+0x270/0x270
[10981.146331]  ? page_get_anon_vma+0x80/0x80
[10981.147523]  shrink_page_list+0x791/0xb70
[10981.148737]  shrink_inactive_list+0x256/0x5e0
[10981.150007]  shrink_node_memcg+0x365/0x780
[10981.151191]  ? _cond_resched+0x1a/0x50
[10981.152316]  shrink_node+0xe1/0x310
[10981.153384]  ? shrink_node+0xe1/0x310
[10981.154739]  do_try_to_free_pages+0xef/0x360
[10981.155776]  try_to_free_pages+0xf2/0x1b0
[10981.156747]  __alloc_pages_slowpath+0x401/0xf10
[10981.157910]  __alloc_pages_nodemask+0x25b/0x280
[10981.159112]  alloc_pages_current+0x6a/0xe0
[10981.160480]  new_slab+0x317/0x690
[10981.161439]  ? __blk_mq_get_tag+0x21/0x90
[10981.162485]  ___slab_alloc+0x3c1/0x4e0
[10981.163458]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.164520]  ? update_load_avg+0x665/0x700
[10981.165545]  ? update_load_avg+0x665/0x700
[10981.166545]  __slab_alloc+0x20/0x40
[10981.167472]  ? __enqueue_entity+0x5c/0x60
[10981.168405]  ? __slab_alloc+0x20/0x40
[10981.169297]  __kmalloc_node+0xe4/0x2b0
[10981.170362]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.171331]  spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.172380]  dbuf_dirty+0x20c/0x880 [zfs]
[10981.173440]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.174606]  dbuf_dirty+0x74e/0x880 [zfs]
[10981.175719]  dnode_setdirty+0xbe/0x100 [zfs]
[10981.176841]  dbuf_dirty+0x46c/0x880 [zfs]
[10981.177809]  dmu_buf_will_dirty+0x11c/0x130 [zfs]
[10981.178814]  dsl_dataset_sync+0x26/0x230 [zfs]
[10981.179801]  dsl_pool_sync+0x9f/0x430 [zfs]
[10981.180747]  spa_sync+0x42d/0xd50 [zfs]
[10981.181653]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[10981.182655]  ? finish_task_switch+0x74/0x200
[10981.183603]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[10981.184578]  thread_generic_wrapper+0x74/0x90 [spl]
[10981.185537]  kthread+0x105/0x140
[10981.186386]  ? __thread_exit+0x20/0x20 [spl]
[10981.187333]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.188470]  ? do_syscall_64+0x73/0x130
[10981.189420]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.190450]  ret_from_fork+0x35/0x40

siv0 avatar Sep 05 '18 13:09 siv0

I decided to give this a try and, as advertised, the deadlock is quite easy to reproduce. First, in a 4.14 kernel, it's interesting to note that a bunch of the "page allocation stalls for ..." messages are generated. That particular message, however, was removed in later kernels because it was determined that merely trying to print the message could, itself, cause deadlocks.

In a 4.16 kernel, which does not have the message, the processes are in quite a few different states. Among other things, an interesting observation is that the "zvol" tasks are generally in a running (R) state. Most of the regular user processes are in various states of dormancy (D). This sshd process is typical ("?" lines elided for brevity):

[ 6212.296211] sshd            D    0  3018   2976 0x00000000
[ 6212.296211] Call Trace:
[ 6212.296211]  schedule+0x32/0x80
[ 6212.296211]  schedule_timeout+0x159/0x330
[ 6212.296211]  io_schedule_timeout+0x19/0x40
[ 6212.296211]  mempool_alloc+0x121/0x150
[ 6212.296211]  bvec_alloc+0x86/0xe0
[ 6212.296211]  bio_alloc_bioset+0x153/0x200
[ 6212.296211]  get_swap_bio+0x40/0xd0
[ 6212.296211]  __swap_writepage+0x292/0x360
[ 6212.296211]  pageout.isra.51+0x1bb/0x300
[ 6212.296211]  shrink_page_list+0x959/0xbd0
[ 6212.296211]  shrink_inactive_list+0x2a8/0x630
[ 6212.296211]  shrink_node_memcg+0x37f/0x7d0
[ 6212.296211]  shrink_node+0xcc/0x300
[ 6212.296211]  do_try_to_free_pages+0xdb/0x330
[ 6212.296211]  try_to_free_pages+0xd5/0x1a0
[ 6212.296211]  __alloc_pages_slowpath+0x37c/0xd80
[ 6212.296211]  __alloc_pages_nodemask+0x227/0x260
[ 6212.296211]  alloc_pages_vma+0x7c/0x1e0
[ 6212.296211]  __read_swap_cache_async+0x149/0x1c0
[ 6212.296211]  do_swap_page_readahead+0xc9/0x1a0
[ 6212.296211]  do_swap_page+0x5a5/0x820
[ 6212.296211]  __handle_mm_fault+0x91a/0x1040
[ 6212.296211]  handle_mm_fault+0xe4/0x1f0
[ 6212.296211]  __do_page_fault+0x242/0x4d0
[ 6212.296211]  do_page_fault+0x2c/0x110
[ 6212.296211]  page_fault+0x25/0x50

Other user dormant user processes such as agetty, etc. all have pretty much identical stacks.

As far as ZFS is concerned, in the current deadlocked system I've got running, the only blocked process is "txg_quiesce" with a rather uninteresting stack of:

[ 5891.836341] txg_quiesce     D    0  2457      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  schedule+0x32/0x80
[ 5891.836341]  cv_wait_common+0x129/0x260 [spl]
[ 5891.836341]  txg_quiesce_thread+0x2fc/0x500 [zfs]
[ 5891.836341]  thread_generic_wrapper+0x76/0xb0 [spl]
[ 5891.836341]  kthread+0xf5/0x130
[ 5891.836341]  ret_from_fork+0x35/0x40

which is in the cv_wait(&tc->tc_cv[g], &tc->tc_lock); at the bottom of txg_quiesce().

The most interesting zfs-related thing is that all the zvol threads are spinning (in "R" state). They all appear to be in their while (ui.ui_resid > 0 ... loop calling dmu_write_uio_dnode() and show this typical stack:

[ 5891.836341] zvol            R  running task        0  3034      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  ? page_referenced+0xb9/0x160
[ 5891.836341]  ? shrink_page_list+0x467/0xbd0
[ 5891.836341]  ? shrink_inactive_list+0x2b5/0x630
[ 5891.836341]  ? shrink_node_memcg+0x37f/0x7d0
[ 5891.836341]  ? __list_lru_init+0x170/0x200
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? do_try_to_free_pages+0xdb/0x330
[ 5891.836341]  ? try_to_free_pages+0xd5/0x1a0
[ 5891.836341]  ? __alloc_pages_slowpath+0x37c/0xd80
[ 5891.836341]  ? get_page_from_freelist+0x145/0x1190
[ 5891.836341]  ? __alloc_pages_nodemask+0x227/0x260
[ 5891.836341]  ? abd_alloc+0x1e4/0x4d0 [zfs]
[ 5891.836341]  ? arc_hdr_alloc_abd+0x5b/0x230 [zfs]
[ 5891.836341]  ? arc_hdr_alloc+0x161/0x390 [zfs]
[ 5891.836341]  ? arc_alloc_buf+0x38/0x110 [zfs]
[ 5891.836341]  ? dmu_buf_will_fill+0x122/0x3b0 [zfs]
[ 5891.836341]  ? dmu_write_uio_dnode+0xe1/0x1a0 [zfs]
[ 5891.836341]  ? zvol_write+0x191/0x6c0 [zfs]

I'm going to try a couple of things: First, set zvol_request_sync=1 to see what difference it might make. Second, add some instrumentation to the zvol path to see whether it might actually be making some forward progress, albeit very slowly.

dweeezil avatar Sep 07 '18 16:09 dweeezil

naively i don't see how swap can be expected to work as things are, i think we should document/label this as such

@behlendorf given the amount of code complexity and potentially allocations required is supporting swap even realistic?

cwedgwood avatar Sep 07 '18 16:09 cwedgwood

@cwedgwood It's definitely been a challenging area to get working and is not heavily stress tested on a wide range of kernels. It was working reliably with older zfs releases and older kernels, but as you say given the code complexity and surface area we'd need to test it's not something I'd recommend. I tend to agree we should make this clearer in the documentation.

behlendorf avatar Sep 07 '18 18:09 behlendorf

I am using this (swap on zfs) on several production systems (home and work) with recent kernels. Thankfully it has not been an issue for me yet.

drescherjm avatar Sep 07 '18 22:09 drescherjm

Same issue on Gentoo with ZFS 0.8.0-rc1 and native encryption.

inpos avatar Sep 17 '18 13:09 inpos

@inpos Out of curiosity, how did you get 0.8.0-rc1 on Gentoo? I have not finished my review of it, so I haven't pushed it to the main tree yet. I might just push it unkeyworded if people would otherwise be resort to building outside the package manager.

ryao avatar Sep 17 '18 13:09 ryao

@behlendorf This was never 100% sane, although the return of the zvol threads certainly did not help. I started (but did not finish) writing a patch that could help with this if finished:

https://paste.pound-python.org/show/KWlvrHdBU2mA9ev2odXL/

At this point, it is fairly clear to me that my offline obligations prevent me from speculating on if/when I will finish that, but other developers should be able to see the idea there. It ought to help if/when finished. The current version will NOT compile, so I would prefer it if users did not attempt it. Trying to compile it would be a waste of their time.

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap. This requires modifications to the code because the current codebase is not able to support a zvol in that configuration. Quite honestly, if we were going for the nuclear option, I'd prefer to create a new "dataset" type and implement extent based allocation.

@ahrens Not that I am seriously considering this, but it would be nice to hear your thoughts on the nuclear option for dealing with swap.

ryao avatar Sep 17 '18 13:09 ryao

@ryao Here content of /usr/local/portage/sys-fs/zfs/zfs-0.8.0_rc1.ebuild

And here content of /usr/local/portage/sys-fs/zfs-kmod/zfs-kmod-0.8.0_rc1.ebuild

inpos avatar Sep 17 '18 14:09 inpos

On second thought, given that swap devices are simultaneously dump devices on Linux, it might make sense to implement support for creating/writing to Illumos dump devices. That would probably be a good middle ground here.

ryao avatar Sep 17 '18 14:09 ryao

I'm not an expert in this area of the code, but I think that swap on ZVOL is inherently unreliable due to writes to the swap ZVOL having to go through the normal TXG sync and ZIO write paths, which can require lots of memory allocations by design (and these memory allocations can stall due to a low memory situation). I believe this to be true for swap on ZVOL for illumos, as well as Linux, and presumably FreeBSD too (although I have no experience using it on FreeBSD, so I could be wrong).

I think the "proper" way to address this is to mimic the write path for a ZVOL dump device on illumos, for the ZVOL swap device. i.e. preallocate the ZVOL's blocks on disk, then do a "direct write" to the preallocated LBA such that it doesn't go through the normal TXG sync and ZIO write code paths. This would significantly reduce the amount of work (e.g. memory allocations) that's required to write a swap page to disk, thereby increasing the reliability of that write.

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

I spent a couple hours reading the ZVOL dump device write codepaths during the OpenZFS dev summit last week, and I think this approach is viable. It'll require porting the code over to work on Linux, where we'll need to rework the code to work with the Linux block device layer, since (IIRC) that's what's used to issue the writes to disk (see zvol_dumpio_vdev() in illumos). Additionally, we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos); for illumos dump devices, this occurs via the DKIOCDUMPINIT ioctl (when zvol_dumpify() is called).

With all that said, I've only spent an hour or two looking into this and I don't have any prior experience with swap, so I may be over looking something or flat out wrong in my analysis so far. Also, I won't have any time in the near future to look into this in detail, but I could potentially help answer questions if somebody else wants to try to prototype the changes I'm proposing.

prakashsurya avatar Sep 17 '18 16:09 prakashsurya

we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos);

I took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device. One solution would be to add a udev helper which detects ZFS volumes with the swap signature and calls zvol_prealloc() via an ioctl(). This should allow things to "just work" on most Linux distributions, though obviously it's Linux specific. We should probably do something a bit more generic and reliable and add a new create-time-only property for volumes. This wouldn't even necessarily prevent us from adding the udev hook in the future.

behlendorf avatar Sep 17 '18 21:09 behlendorf

@behlendorf that's unfortunate; given there's no good entry point to do the preallocation, your two suggestions seem reasonable at first glance.

Using a new ioctl (or create-time property) to preallocate and "convert" a ZVOL to a "direct write" ZVOL might be a neat feature, as it could enable this use case of swap, but also other use cases where the user would like to forego the COW semantics (e.g. a zpool layered on top of ZVOLs).

prakashsurya avatar Sep 17 '18 21:09 prakashsurya

Maybe this wiki page

https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-use-a-zvol-as-a-swap-device

should be edited to warn about this issue?

davidklaftenegger avatar Dec 07 '18 22:12 davidklaftenegger

I took the liberty of amending the wiki (I've been affected by this as well). It's quite reproducible in v0.6.5.9-5~bpo8+1 running on 3.16.0-7-amd64 (Debian 3.16.59-1), so I look forward to an eventual fix. I've found no sustainable workaround (other than to stop using zvols for swap).

MobyGamer avatar Jan 18 '19 15:01 MobyGamer

Thanks @MobyGamer, I'm running ZFS 0.7.12-1~bpo9+1 on Debian Stretch (kernel 4.19.12-1~bpo9+1) on a system with 2GB of RAM and I experienced this issue. I had followed the Debian Stretch Root on ZFS and was lead to believe I should use swap on ZFS. Searching for the cause, there was really nothing in the wiki's about potential down-sides to doing it, and the only "against" I managed to find was this issue. It's good that there is at least some documentation regarding it now 😄.

mafredri avatar Jan 18 '19 18:01 mafredri

@mafredri i updated the wiki. Indeed, we didn't find problems with Swap on ZFS 0.7 branch for a while.

gmelikov avatar Jan 21 '19 09:01 gmelikov

FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)

This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.

didrocks avatar Oct 31 '19 09:10 didrocks

With multi_vdev_crash_dump it should be possible to swap into a preallocated zvol or file, the writes bypassing zfs completely, right? But it is not implemented yet.

scineram avatar Oct 31 '19 10:10 scineram

@didrocks I gave some information on a possible approach to address this long term, in my comment above. At this point, I think it's accepted that using a ZVOL for swap will result in instabilities, but we haven't had anybody with the time or motivation to step up and attempt to fix this.

I think at least partially, the lack of motivation stems from the fact that (I assume) most folks using ZFS on Linux are not using it for the root filesystem (I use it personally, and so does my employer, so I don't mean this as nobody uses root on ZFS). Now that Ubuntu is officially supporting a root filesystem on ZFS configuration, perhaps that will change. I mean that, both in terms of more users of a root filesystem on ZFS, but also in terms of more developers with sufficient motivation to fix this issue.

prakashsurya avatar Oct 31 '19 16:10 prakashsurya

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap.

and:

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

Not that my 2¢ means much as a non-contributor, but I'd personally really prefer to see a design where at least checksums can be maintained functional so the possibility of end-to-end integrity can remain. Can anyone speak to the amount of complexity/effort that that single feature would add to this work?

I know above Illumos dump devices were mentioned. I am father unfamiliar with Illumos, but I thought dump and swap devices were not the same thing, and that Illumos actually did use zvols for swap. Am I incorrect, or were the above comparisons not quite valid?

RSully avatar Nov 04 '19 23:11 RSully

I'd personally really prefer to see a design where at least checksums can be maintained functional

I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data.

and that Illumos actually did use zvols for swap.

Yes, we used ZVOLs for swap on our illumos based appliance, and we often saw (likely the same) deadlocks for workloads that tried to actually use the swap space.

prakashsurya avatar Nov 05 '19 00:11 prakashsurya

FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)

This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.

The page out code is racing to get data out of memory by handing it to the block device, but ZFS is designed to write data asynchronously whenever possible. A problem occurs here because the attempts to do pageout counterintuitively increase memory pressure in the short term (between IO submission and txg commit) because ZFS must do book keeping on the data handed to it before it is actually sent to disk as part of the txg commit.

Creating your swap zvols like this should fix the problem (of pageout to a swap zvol being counterproductive) by using the ZIL to bypass the txg commit and make the ARC buffers used to store the data evictable before the txg commit is done:

zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap

That is suboptimal because single queue depth sync writeout performance with 4K IOs is fairly terrible. I recall seeing 10MB/sec performance on an Intel SSD like 5 or 6 years ago. I have not been keeping track of the status of it, but I expect ZIL pipelining to make the write out performance of a swap zvol setup this way better.

Someone probably should check to see if we take advantage of the ability to do immediate eviction when writes are done to a zvol setup this way. There might still be a lag (although it is a much better situation than going with the default settings).

ryao avatar Nov 05 '19 07:11 ryao

I'd personally really prefer to see a design where at least checksums can be maintained functional

I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data

The best sketch of a solution that can preserve checksums that I have involves making a special swap property on zvols that alters several behaviors:

  • Forces sync=always
  • Forces logbias=throughput
  • Forces primarycache=metadata and secondarycache=none if they are set to higher settings.
  • Ensures immediate eviction from ARC and L2ARC after the ZIOs complete.
  • Prints a warning to dmesg at creation / the first IO if the volblocksize does not match the system page size.
  • Marks the buffers allocated to process the writes as being allowed to pull from system emergency memory reserves so that we can pass that to the Linux memory allocator. Every buffer allocated with this marking must be immediately freed after the ZIOs complete. Nothing may be allowed to stick around.
  • Makes writes cut to the front of the line in the ZIO elevator.

A few bonus improvements would be:

  • Implementing ZIL pipelining (I have not looked at the status of that in a while)
  • Over allocating space for the data referenced by the indirect ZIL records so that we can amortize metaslab allocations. This would be a step beyond ZIL pipelining in improving ZIL performance. I am not sure if it would be worthwhile, but it is an idea that occurred to me when thinking about this.

Some additional ideas that ignore suspend to disk (which is racy on a zvol, but that is for another issue):

  • Skipping ZIL replay at boot for zvols marked as swap.
  • Extending non-legacy mountpoints to mount swap zvols. This would let us mount them with discard. It would also let us wipe them clean at every mount.

This is a mix of my recollection of my past ideas on how to solve this with some new ones that occurred to me when (or shortly after) writing it. I had originally hoped that there might be a BIO flag that would avoid the need for an explicit property, but I am not aware of one. The ideas here probably need to be developed a bit more, but I believe that they would solve the problem.

That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:

zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap

This should do most of the things on the first list.

ryao avatar Nov 05 '19 07:11 ryao

@ryao In what you're proposing, where we lean heavily on the ZIL to do the writes, we'll still need to do on-disk allocations "on demand" as we write out data to the swap ZVOL, right? Those on-disk allocations may require additional memory allocations (e.g. loading of metaslabs on heavily fragmented systems), which may be a cause for concern (e.g. hangs) when we're in a low (free) memory state. Forcing the use of the ZIL would likely improve the situation, but I'm not sure it would theoretically eliminate the problem.

prakashsurya avatar Nov 05 '19 15:11 prakashsurya

Using rsync locally to move data from on dataset to another (send|recv was not an option in this specific case), the machine froze completely after a while (Linux 5.3, zfs/master @ g177c79df). Limiting zfs_arc_max to prevent swap at all helped.

That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this:

zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap

[Update] Correction: Does not work here. See Post below

However, what about compression on the swap zvol as stated on the zvol swap wiki page?

Just to make sure, I disabled compression on the swap zvol and re-created the swap fs.

Question: How would compression on the swap zvol influence the deadlock behavior?

spacerunner5 avatar Nov 06 '19 07:11 spacerunner5

Using rsync locally to move data from on dataset to another (send|recv was not an option in this specific case), the machine froze completely after a while (Linux 5.3, zfs/master @ g177c79df). Limiting zfs_arc_max to prevent swap at all helped.

That being said, without any code changes, the best that can be done to my knowledge is to force everything through ZIL like this: zfs create -o sync=always -o primarycache=metadata -o secondarycache=none -o logbias=throughput -b 4K -V 8G rpool/swap

Setting the swap zvol to parameters shown above works for me.

However, what about compression on the swap zvol as stated on the zvol swap wiki page?

Just to make sure, I disabled compression on the swap zvol and re-created the swap fs.

Question: How would compression on the swap zvol influence the deadlock behavior?

ZLE compression just does zero elimination. It is a clever idea, although it likely has no effect in practice because applications generally do not keep significant quantities of zeroed pages in their address spaces. Offhand, I cannot think of a way that it could possibly make things worse. Explicitly setting it could be an improvement by keeping the swap zvol from inheriting a compression setting from the root dataset that involves doing memory allocations.

ryao avatar Nov 06 '19 11:11 ryao