zfs icon indicating copy to clipboard operation
zfs copied to clipboard

100% CPU load from arc_prune

Open dmaziuk opened this issue 5 years ago • 80 comments

System information

Type Version/Name
Distribution Name centos
Distribution Version CentOS release 6.10 (Final)
Linux Kernel Linux hereland.nmrfam.wisc.edu 2.6.32-754.25.1.el6.x86_64 #1 SMP Mon Dec 23 15:19:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Architecture AMD 64
ZFS Version 0.8.3-1
SPL Version 0.8.3-1

Describe the problem you're observing

Moving 14TB of external data into a zfs dataset, after a couple of days the CPU is at 99% system load. Clients time out trying to mount the dataset (exported via sharenfs).

Describe how to reproduce the problem

It's the first one for me, no idea if it's reproducible.

Include any warning/errors/backtraces from the system logs

top - 09:49:18 up 3 days, 21:38,  3 users,  load average: 48.27, 47.56, 47.73
Tasks: 653 total,  31 running, 622 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 99.0%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32993392k total, 28381572k used,  4611820k free,    66360k buffers
Swap: 65535996k total,    57400k used, 65478596k free,  1500656k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20905 root      20   0     0    0    0 R 98.7  0.0 167:51.80 arc_prune
 7821 root      20   0     0    0    0 R 98.4  0.0 212:08.44 arc_prune
 8292 root      20   0     0    0    0 R 98.4  0.0  70:27.30 arc_prune
20589 root      20   0     0    0    0 R 98.4  0.0 168:11.97 arc_prune
22506 root      20   0     0    0    0 R 98.4  0.0 166:39.25 arc_prune
 1772 root      20   0     0    0    0 R 98.1  0.0 475:47.11 arc_prune
 8502 root      20   0     0    0    0 R 98.1  0.0 212:09.17 arc_prune
18799 root      20   0     0    0    0 R 98.1  0.0 169:55.35 arc_prune
19397 root      20   0     0    0    0 R 98.1  0.0 169:08.63 arc_prune
20312 root      20   0     0    0    0 R 98.1  0.0 168:14.17 arc_prune
23623 root      20   0     0    0    0 R 98.1  0.0 164:48.99 arc_prune
28861 root      20   0     0    0    0 R 98.1  0.0 137:44.49 arc_prune
21294 root      20   0     0    0    0 R 97.8  0.0 167:55.44 arc_prune
19192 root      20   0     0    0    0 R 96.5  0.0   4:46.02 arc_prune
23087 root      20   0     0    0    0 R 95.8  0.0 165:28.73 arc_prune
19802 root      20   0     0    0    0 R 94.9  0.0 168:58.81 arc_prune
 1773 root      39  19     0    0    0 R  6.1  0.0  84:38.77 zthr_procedure
19929 root      20   0 15432 1724  944 R  1.9  0.0   0:00.50 top

dmaziuk avatar Feb 07 '20 16:02 dmaziuk

More info:

# zpool status
  pool: backups
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:10:52 with 0 errors on Sat Feb  1 00:10:54 2020
config:

        NAME                        STATE     READ WRITE CKSUM
        backups                     ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            scsi-35000c500860d0c4b  ONLINE       0     0     0
            scsi-35000c500860d1543  ONLINE       0     0     0
            scsi-35000c500860f9437  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 21:22:47 with 0 errors on Sat Feb  1 21:22:49 2020
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz1-0                        ONLINE       0     0     0
            scsi-35000c500860ba403        ONLINE       0     0     0
            scsi-35000c500860bbb4f        ONLINE       0     0     0
            scsi-35000c500860bbb5f        ONLINE       0     0     0
          raidz1-2                        ONLINE       0     0     0
            scsi-35000c500860bcb77        ONLINE       0     0     0
            scsi-35000c500860cf9d7        ONLINE       0     0     0
            scsi-35000c500860d0b67        ONLINE       0     0     0
          raidz1-3                        ONLINE       0     0     0
            scsi-35000c500860f96db        ONLINE       0     0     0
            scsi-35000c500860fc85f        ONLINE       0     0     0
            scsi-35000c500860fd93b        ONLINE       0     0     0
        logs
          mirror-1                        ONLINE       0     0     0
            scsi-35000c500301bf47f-part1  ONLINE       0     0     0
            scsi-35000c500301bf69f-part1  ONLINE       0     0     0
        cache
          scsi-35000c500301bf6c3-part1    ONLINE       0     0     0

errors: No known data errors

# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
backups      402G  21.4T      1      0  6.71K   204K
tank        31.9T  33.5T    783    217  33.8M  26.7M
----------  -----  -----  -----  -----  -----  -----

ARC_MAX is not set, it's left at the default. Cache SSD is 160GB. All disks are dual-ported SAS drives in an external shelf.

dmaziuk avatar Feb 07 '20 16:02 dmaziuk

Even more info:

...
Feb  7 04:55:40 hereland zed[6190]: Invoking "all-syslog.sh" eid=180 pid=16962
Feb  7 04:55:40 hereland zed: eid=180 class=history_event pool_guid=0x68C093597CA7900F
Feb  7 04:55:40 hereland zed[6190]: Finished "all-syslog.sh" eid=180 pid=16962 exit=0
Feb  7 04:55:41 hereland zed[6190]: Invoking "all-syslog.sh" eid=181 pid=17099
Feb  7 04:55:41 hereland zed: eid=181 class=history_event pool_guid=0x68C093597CA7900F
Feb  7 04:55:41 hereland zed[6190]: Finished "all-syslog.sh" eid=181 pid=17099 exit=0
Feb  7 06:14:42 hereland kernel: INFO: task nfsd:6109 blocked for more than 120 seconds.
...

There's nothing out of the ordinary in syslog until the "hung task" nfsd.

dmaziuk avatar Feb 07 '20 16:02 dmaziuk

Does it look like I'm getting hit by "don't make the cache disk bigger than 5 x 1/2 RAM" problem? I thought that was no longer a thing.

dmaziuk avatar Feb 07 '20 16:02 dmaziuk

@dmaziuk that's a possibility. Can you post the contents of the /proc/spl/kstat/zfs/arcstats file, it should give us a better idea why the system is so aggressively trying to reclaim from the ARC.

behlendorf avatar Feb 07 '20 17:02 behlendorf

# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14203091063 342692306280336
name                            type data
hits                            4    1961861302
misses                          4    214595515
demand_data_hits                4    535849124
demand_data_misses              4    21425564
demand_metadata_hits            4    1310503883
demand_metadata_misses          4    29241451
prefetch_data_hits              4    1327
prefetch_data_misses            4    146730344
prefetch_metadata_hits          4    115506968
prefetch_metadata_misses        4    17198156
mru_hits                        4    760224963
mru_ghost_hits                  4    1685841
mfu_hits                        4    1094334124
mfu_ghost_hits                  4    1919548
deleted                         4    294724177
mutex_miss                      4    24519247
access_skip                     4    701
evict_skip                      4    554617006
evict_not_enough                4    75691673
evict_l2_cached                 4    8277499373056
evict_l2_eligible               4    14072033056256
evict_l2_ineligible             4    9966615764992
evict_l2_skip                   4    4031567
hash_elements                   4    2149096
hash_elements_max               4    6808116
hash_collisions                 4    222109741
hash_chains                     4    394313
hash_chain_max                  4    11
p                               4    15524174769
c                               4    16823934976
c_min                           4    1051495936
c_max                           4    16823934976
size                            4    16951797592
compressed_size                 4    1377493504
uncompressed_size               4    4542341632
overhead_size                   4    4540340224
hdr_size                        4    101469040
data_size                       4    742869504
metadata_size                   4    5174964224
dbuf_size                       4    2256648576
dnode_size                      4    6258464128
bonus_size                      4    2234942720
anon_size                       4    705024
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    1051861504
mru_evictable_data              4    0
mru_evictable_metadata          4    8192
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    4865267200
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    3173912
l2_misses                       4    211421480
l2_feeds                        4    786569
l2_rw_clash                     4    16
l2_read_bytes                   4    11567827968
l2_write_bytes                  4    4970878013440
l2_writes_sent                  4    689715
l2_writes_done                  4    689715
l2_writes_error                 4    0
l2_writes_lock_retry            4    1425
l2_evict_lock_retry             4    1161
l2_evict_reading                4    37
l2_evict_l1cached               4    1921025
l2_free_on_write                4    385458
l2_abort_lowmem                 4    119
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    182019350528
l2_asize                        4    140228637696
l2_hdr_size                     4    182439072
memory_throttle_count           4    0
memory_direct_count             4    143
memory_indirect_count           4    1174580
memory_all_bytes                4    33785233408
memory_free_bytes               4    7054409728
memory_available_bytes          3    6528663552
arc_no_grow                     4    0
arc_tempreserve                 4    9216
arc_loaned_bytes                4    0
arc_prune                       4    13876726
arc_meta_used                   4    16208928088
arc_meta_limit                  4    12617951232
arc_dnode_limit                 4    1261795123
arc_meta_max                    4    16893738168
arc_meta_min                    4    16777216
async_upgrade_sync              4    1740388
demand_hit_predictive_prefetch  4    67124685
demand_hit_prescient_prefetch   4    83590118
arc_need_free                   4    0
arc_sys_free                    4    525747968
arc_raw_size                    4    0

dmaziuk avatar Feb 07 '20 17:02 dmaziuk

System load is down to 95-ish% now BTW (but nfsd is still hosed)

dmaziuk avatar Feb 07 '20 17:02 dmaziuk

According to the arcstats the ARC is entirely filled with metadata and is over the target (arc_meta_used > arc_meta_limit). The arc_prune threads are trying to to evict some of this metadata to get down to the 75% target value.

arc_meta_used                   4    16208928088
arc_meta_limit                  4    12617951232

They appear to be unable to make any progress on this which is why they're spinning. If I recall correctly, this was caused by the file handles cached by the nfsd holding a reference which prevents the ARC form freeing the buffers. I believe the nfsd behavior here is a little different in newer kernels and better behaved.

You should be able to stop the spinning by setting the limit to 100%. You can make this the default with the zfs_arc_meta_limit_percent=100 module option. But you'll need to reload the ZFS modules for that to take effect. To change the limit immediately, you can set zfs_arc_meta_limit to the arc_meta_max.

echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit

behlendorf avatar Feb 07 '20 18:02 behlendorf

echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit

Done, arc_meta_used is going down now, but very slowly, with occasional dips in system load down to 80%-ish.

@behlendorf if you need anything else from this system, say so because my users are crying, I have to bounce it.

Could it be possible to make it not use all available cores? -- it's a 16-core system that has 16 instances of arc_prune. I wonder if it should be limited to e.g. half the cores, like arc is limited to half the RAM by default.

dmaziuk avatar Feb 07 '20 18:02 dmaziuk

Feel free to bounce the system, the arcstats are enough to confirm the issue. It definitely can, and should, be improved. We have some ideas about how to improve things.

behlendorf avatar Feb 07 '20 19:02 behlendorf

Is this from RAM ARC or cache SSD ARC? I.e. if it's the latter, can I get around it by disconnecting cache device? -- Temporarily, while moving a zillion files around.

dmaziuk avatar Feb 07 '20 19:02 dmaziuk

It's the RAM ARC, disconnecting the cache device won't help. Setting zfs_arc_meta_limit_percent=100 should avoid the spinning.

behlendorf avatar Feb 07 '20 20:02 behlendorf

OK, thank you. I did set in /etc/modprobe.d/zfs.conf though I really don't expect to hit this again. Unless another group moves in with 8 years worth of data that they think they want on our "big" file server.

Thanks again

dmaziuk avatar Feb 07 '20 20:02 dmaziuk

To change the limit immediately, you can set zfs_arc_meta_limit to the arc_meta_max

@behlendorf thanks for providing the hint, it made the spinning go away for me as well! below were the parameters/statistics on my system when spinning occured (happened during send/receive to/from local pools)

how do you plan to solve this by default in the future without needing user intervention?

root@zfs:~# systool -vm spl
Module = "spl"

  Attributes:
    coresize            = "106496"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "5"
    srcversion          = "09B6CA3EE9E5F5E76F82D5C"
    taint               = "OE"
    uevent              = <store method only>
    version             = "0.8.3-1~rlaager1.18.04.1"

  Parameters:
    spl_hostid          = "0"
    spl_hostid_path     = "/etc/hostid"
    spl_kmem_alloc_max  = "1048576"
    spl_kmem_alloc_warn = "65536"
    spl_kmem_cache_expire= "2"
    spl_kmem_cache_kmem_limit= "2048"
    spl_kmem_cache_kmem_threads= "4"
    spl_kmem_cache_magazine_size= "0"
    spl_kmem_cache_max_size= "32"
    spl_kmem_cache_obj_per_slab_min= "1"
    spl_kmem_cache_obj_per_slab= "8"
    spl_kmem_cache_reclaim= "0"
    spl_kmem_cache_slab_limit= "16384"
    spl_max_show_tasks  = "512"
    spl_panic_halt      = "0"
    spl_schedule_hrtimeout_slack_us= "0"
    spl_taskq_kick      = "0"
    spl_taskq_thread_bind= "0"
    spl_taskq_thread_dynamic= "1"
    spl_taskq_thread_priority= "1"
    spl_taskq_thread_sequential= "4"

  Sections:
    .altinstr_replacement= "0xffffffffc049b795"
    .altinstructions    = "0xffffffffc049f082"
    .bss                = "0xffffffffc04a0ec0"
    .data               = "0xffffffffc04a0000"
    .exit.text          = "0xffffffffc049b762"
    .gnu.linkonce.this_module= "0xffffffffc04a0b80"
    .init.text          = "0xffffffffc04ac000"
    .note.gnu.build-id  = "0xffffffffc049c000"
    .parainstructions   = "0xffffffffc049d358"
    .rodata             = "0xffffffffc049c820"
    .rodata.str1.1      = "0xffffffffc049cdc6"
    .rodata.str1.8      = "0xffffffffc049e918"
    .smp_locks          = "0xffffffffc049d2c4"
    .strtab             = "0xffffffffc04b2160"
    .symtab             = "0xffffffffc04ad000"
    .text               = "0xffffffffc0491000"
    __bug_table         = "0xffffffffc04a0b60"
    __ksymtab           = "0xffffffffc049c030"
    __ksymtab_strings   = "0xffffffffc049d7d4"
    __mcount_loc        = "0xffffffffc049e238"
    __param             = "0xffffffffc049def0"

root@zfs:~# systool -vm zfs
Module = "zfs"

  Attributes:
    coresize            = "3854336"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "38"
    srcversion          = "C6177AA5049CC30B672B1CA"
    taint               = "POE"
    uevent              = <store method only>
    version             = "0.8.3-1~rlaager1.18.04.1"

  Parameters:
    dbuf_cache_hiwater_pct= "10"
    dbuf_cache_lowater_pct= "10"
    dbuf_cache_max_bytes= "33554432"
    dbuf_cache_shift    = "5"
    dbuf_metadata_cache_max_bytes= "16777216"
    dbuf_metadata_cache_shift= "6"
    dmu_object_alloc_chunk_shift= "7"
    dmu_prefetch_max    = "134217728"
    ignore_hole_birth   = "1"
    l2arc_feed_again    = "1"
    l2arc_feed_min_ms   = "200"
    l2arc_feed_secs     = "1"
    l2arc_headroom      = "2"
    l2arc_headroom_boost= "200"
    l2arc_noprefetch    = "1"
    l2arc_norw          = "0"
    l2arc_write_boost   = "8388608"
    l2arc_write_max     = "8388608"
    metaslab_aliquot    = "524288"
    metaslab_bias_enabled= "1"
    metaslab_debug_load = "0"
    metaslab_debug_unload= "0"
    metaslab_df_max_search= "16777216"
    metaslab_df_use_largest_segment= "0"
    metaslab_force_ganging= "16777217"
    metaslab_fragmentation_factor_enabled= "1"
    metaslab_lba_weighting_enabled= "1"
    metaslab_preload_enabled= "1"
    send_holes_without_birth_time= "1"
    spa_asize_inflation = "24"
    spa_config_path     = "/etc/zfs/zpool.cache"
    spa_load_print_vdev_tree= "0"
    spa_load_verify_data= "1"
    spa_load_verify_metadata= "1"
    spa_load_verify_shift= "4"
    spa_slop_shift      = "5"
    vdev_removal_max_span= "32768"
    vdev_validate_skip  = "0"
    zap_iterate_prefetch= "1"
    zfetch_array_rd_sz  = "1048576"
    zfetch_max_distance = "8388608"
    zfetch_max_streams  = "8"
    zfetch_min_sec_reap = "2"
    zfs_abd_scatter_enabled= "1"
    zfs_abd_scatter_max_order= "10"
    zfs_abd_scatter_min_size= "1536"
    zfs_admin_snapshot  = "0"
    zfs_arc_average_blocksize= "8192"
    zfs_arc_dnode_limit = "0"
    zfs_arc_dnode_limit_percent= "10"
    zfs_arc_dnode_reduce_percent= "10"
    zfs_arc_grow_retry  = "0"
    zfs_arc_lotsfree_percent= "10"
    zfs_arc_max         = "1073741824"
    zfs_arc_meta_adjust_restarts= "4096"
    zfs_arc_meta_limit_percent= "75"
    zfs_arc_meta_limit  = "0"
    zfs_arc_meta_min    = "0"
    zfs_arc_meta_prune  = "10000"
    zfs_arc_meta_strategy= "1"
    zfs_arc_min_prefetch_ms= "0"
    zfs_arc_min_prescient_prefetch_ms= "0"
    zfs_arc_min         = "0"
    zfs_arc_p_dampener_disable= "1"
    zfs_arc_p_min_shift = "0"
    zfs_arc_pc_percent  = "0"
    zfs_arc_shrink_shift= "0"
    zfs_arc_sys_free    = "0"
    zfs_async_block_max_blocks= "100000"
    zfs_autoimport_disable= "1"
    zfs_checksum_events_per_second= "20"
    zfs_commit_timeout_pct= "5"
    zfs_compressed_arc_enabled= "1"
    zfs_condense_indirect_commit_entry_delay_ms= "0"
    zfs_condense_indirect_vdevs_enable= "1"
    zfs_condense_max_obsolete_bytes= "1073741824"
    zfs_condense_min_mapping_bytes= "131072"
    zfs_dbgmsg_enable   = "1"
    zfs_dbgmsg_maxsize  = "4194304"
    zfs_dbuf_state_index= "0"
    zfs_ddt_data_is_special= "1"
    zfs_deadman_checktime_ms= "60000"
    zfs_deadman_enabled = "1"
    zfs_deadman_failmode= "wait"
    zfs_deadman_synctime_ms= "600000"
    zfs_deadman_ziotime_ms= "300000"
    zfs_dedup_prefetch  = "0"
    zfs_delay_min_dirty_percent= "60"
    zfs_delay_scale     = "500000"
    zfs_delete_blocks   = "20480"
    zfs_dirty_data_max_max= "2020201472"
    zfs_dirty_data_max_max_percent= "25"
    zfs_dirty_data_max_percent= "10"
    zfs_dirty_data_max  = "808080588"
    zfs_dirty_data_sync_percent= "20"
    zfs_disable_ivset_guid_check= "0"
    zfs_dmu_offset_next_sync= "0"
    zfs_expire_snapshot = "300"
    zfs_flags           = "0"
    zfs_free_bpobj_enabled= "1"
    zfs_free_leak_on_eio= "0"
    zfs_free_min_time_ms= "1000"
    zfs_immediate_write_sz= "32768"
    zfs_initialize_value= "16045690984833335022"
    zfs_key_max_salt_uses= "400000000"
    zfs_lua_max_instrlimit= "100000000"
    zfs_lua_max_memlimit= "104857600"
    zfs_max_missing_tvds= "0"
    zfs_max_recordsize  = "1048576"
    zfs_metaslab_fragmentation_threshold= "70"
    zfs_metaslab_segment_weight_enabled= "1"
    zfs_metaslab_switch_threshold= "2"
    zfs_mg_fragmentation_threshold= "95"
    zfs_mg_noalloc_threshold= "0"
    zfs_multihost_fail_intervals= "10"
    zfs_multihost_history= "0"
    zfs_multihost_import_intervals= "20"
    zfs_multihost_interval= "1000"
    zfs_multilist_num_sublists= "0"
    zfs_no_scrub_io     = "0"
    zfs_no_scrub_prefetch= "0"
    zfs_nocacheflush    = "0"
    zfs_nopwrite_enabled= "1"
    zfs_object_mutex_size= "64"
    zfs_obsolete_min_time_ms= "500"
    zfs_override_estimate_recordsize= "0"
    zfs_pd_bytes_max    = "52428800"
    zfs_per_txg_dirty_frees_percent= "5"
    zfs_prefetch_disable= "0"
    zfs_read_chunk_size = "1048576"
    zfs_read_history    = "0"
    zfs_read_history_hits= "0"
    zfs_reconstruct_indirect_combinations_max= "4096"
    zfs_recover         = "0"
    zfs_recv_queue_length= "16777216"
    zfs_removal_ignore_errors= "0"
    zfs_removal_suspend_progress= "0"
    zfs_remove_max_segment= "16777216"
    zfs_resilver_disable_defer= "0"
    zfs_resilver_min_time_ms= "3000"
    zfs_scan_checkpoint_intval= "7200"
    zfs_scan_fill_weight= "3"
    zfs_scan_ignore_errors= "0"
    zfs_scan_issue_strategy= "0"
    zfs_scan_legacy     = "0"
    zfs_scan_max_ext_gap= "2097152"
    zfs_scan_mem_lim_fact= "20"
    zfs_scan_mem_lim_soft_fact= "20"
    zfs_scan_strict_mem_lim= "0"
    zfs_scan_suspend_progress= "0"
    zfs_scan_vdev_limit = "4194304"
    zfs_scrub_min_time_ms= "1000"
    zfs_send_corrupt_data= "0"
    zfs_send_queue_length= "16777216"
    zfs_send_unmodified_spill_blocks= "1"
    zfs_slow_io_events_per_second= "20"
    zfs_spa_discard_memory_limit= "16777216"
    zfs_special_class_metadata_reserve_pct= "25"
    zfs_sync_pass_deferred_free= "2"
    zfs_sync_pass_dont_compress= "8"
    zfs_sync_pass_rewrite= "2"
    zfs_sync_taskq_batch_pct= "75"
    zfs_trim_extent_bytes_max= "134217728"
    zfs_trim_extent_bytes_min= "32768"
    zfs_trim_metaslab_skip= "0"
    zfs_trim_queue_limit= "10"
    zfs_trim_txg_batch  = "32"
    zfs_txg_history     = "100"
    zfs_txg_timeout     = "5"
    zfs_unlink_suspend_progress= "0"
    zfs_user_indirect_is_special= "1"
    zfs_vdev_aggregate_trim= "0"
    zfs_vdev_aggregation_limit_non_rotating= "131072"
    zfs_vdev_aggregation_limit= "1048576"
    zfs_vdev_async_read_max_active= "3"
    zfs_vdev_async_read_min_active= "1"
    zfs_vdev_async_write_active_max_dirty_percent= "60"
    zfs_vdev_async_write_active_min_dirty_percent= "30"
    zfs_vdev_async_write_max_active= "10"
    zfs_vdev_async_write_min_active= "2"
    zfs_vdev_cache_bshift= "16"
    zfs_vdev_cache_max  = "16384"
    zfs_vdev_cache_size = "0"
    zfs_vdev_default_ms_count= "200"
    zfs_vdev_initializing_max_active= "1"
    zfs_vdev_initializing_min_active= "1"
    zfs_vdev_max_active = "1000"
    zfs_vdev_min_ms_count= "16"
    zfs_vdev_mirror_non_rotating_inc= "0"
    zfs_vdev_mirror_non_rotating_seek_inc= "1"
    zfs_vdev_mirror_rotating_inc= "0"
    zfs_vdev_mirror_rotating_seek_inc= "5"
    zfs_vdev_mirror_rotating_seek_offset= "1048576"
    zfs_vdev_ms_count_limit= "131072"
    zfs_vdev_queue_depth_pct= "1000"
    zfs_vdev_raidz_impl = "cycle [fastest] original scalar sse2 ssse3 "
    zfs_vdev_read_gap_limit= "32768"
    zfs_vdev_removal_max_active= "2"
    zfs_vdev_removal_min_active= "1"
    zfs_vdev_scheduler  = "unused"
    zfs_vdev_scrub_max_active= "2"
    zfs_vdev_scrub_min_active= "1"
    zfs_vdev_sync_read_max_active= "10"
    zfs_vdev_sync_read_min_active= "10"
    zfs_vdev_sync_write_max_active= "10"
    zfs_vdev_sync_write_min_active= "10"
    zfs_vdev_trim_max_active= "2"
    zfs_vdev_trim_min_active= "1"
    zfs_vdev_write_gap_limit= "4096"
    zfs_zevent_cols     = "80"
    zfs_zevent_console  = "0"
    zfs_zevent_len_max  = "512"
    zfs_zil_clean_taskq_maxalloc= "1048576"
    zfs_zil_clean_taskq_minalloc= "1024"
    zfs_zil_clean_taskq_nthr_pct= "100"
    zil_maxblocksize    = "131072"
    zil_nocacheflush    = "0"
    zil_replay_disable  = "0"
    zil_slog_bulk       = "786432"
    zio_deadman_log_all = "0"
    zio_dva_throttle_enabled= "1"
    zio_requeue_io_start_cut_in_line= "1"
    zio_slow_io_ms      = "30000"
    zio_taskq_batch_pct = "75"
    zvol_inhibit_dev    = "0"
    zvol_major          = "230"
    zvol_max_discard_blocks= "16384"
    zvol_prefetch_bytes = "131072"
    zvol_request_sync   = "0"
    zvol_threads        = "32"
    zvol_volmode        = "1"

  Sections:
    .bss                = "0xffffffffc07d2100"
    .data               = "0xffffffffc07c0060"
    .exit.text          = "0xffffffffc07738c9"
    .gnu.linkonce.this_module= "0xffffffffc07d1dc0"
    .init.text          = "0xffffffffc09a7000"
    .note.gnu.build-id  = "0xffffffffc0774000"
    .parainstructions   = "0xffffffffc07aec00"
    .rodata             = "0xffffffffc0776800"
    .rodata.str1.1      = "0xffffffffc078de1c"
    .rodata.str1.8      = "0xffffffffc0792cb0"
    .smp_locks          = "0xffffffffc078d684"
    .strtab             = "0xffffffffc09e1178"
    .symtab             = "0xffffffffc09a9000"
    .text               = "0xffffffffc05f9000"
    .text.unlikely      = "0xffffffffc07731d8"
    __bug_table         = "0xffffffffc07d1da4"
    __jump_table        = "0xffffffffc07c0000"
    __ksymtab           = "0xffffffffc0774030"
    __ksymtab_strings   = "0xffffffffc07bd148"
    __mcount_loc        = "0xffffffffc07b4a28"
    __param             = "0xffffffffc07b2610"


root@zfs:~# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14093184634 1224910481851349
name                            type data
hits                            4    704041195
misses                          4    2744949
demand_data_hits                4    260501
demand_data_misses              4    476953
demand_metadata_hits            4    703209429
demand_metadata_misses          4    969970
prefetch_data_hits              4    11386
prefetch_data_misses            4    239621
prefetch_metadata_hits          4    559879
prefetch_metadata_misses        4    1058405
mru_hits                        4    13617271
mru_ghost_hits                  4    17737
mfu_hits                        4    690001738
mfu_ghost_hits                  4    1765
deleted                         4    3101597
mutex_miss                      4    1677
access_skip                     4    5
evict_skip                      4    39643211
evict_not_enough                4    7411892
evict_l2_cached                 4    0
evict_l2_eligible               4    21536072704
evict_l2_ineligible             4    40198401024
evict_l2_skip                   4    0
hash_elements                   4    13708
hash_elements_max               4    432738
hash_collisions                 4    364283
hash_chains                     4    94
hash_chain_max                  4    5
p                               4    689080832
c                               4    1073741824
c_min                           4    252525184
c_max                           4    1073741824
size                            4    1222968480
compressed_size                 4    295622656
uncompressed_size               4    332941312
overhead_size                   4    273108480
hdr_size                        4    6912992
data_size                       4    33457664
metadata_size                   4    535273472
dbuf_size                       4    133838016
dnode_size                      4    380898624
bonus_size                      4    132587712
anon_size                       4    50364416
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    310185984
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    208180736
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    8080805888
memory_free_bytes               4    3879395328
memory_available_bytes          3    3753136128
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    15495680
arc_prune                       4    11107355
arc_meta_used                   4    1189510816
arc_meta_limit                  4    805306368
arc_dnode_limit                 4    80530636
arc_meta_max                    4    1189630648
arc_meta_min                    4    16777216
async_upgrade_sync              4    1796
demand_hit_predictive_prefetch  4    77527
demand_hit_prescient_prefetch   4    1448691
arc_need_free                   4    0
arc_sys_free                    4    126262592
arc_raw_size                    4    195579904


root@zfs:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Feb 12 16:33:48 2020
Linux 4.15.0-76-generic                         0.8.3-1~rlaager1.18.04.1
Machine: zfs (x86_64)                       0.8.3-1~rlaager1.18.04.1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   114.5 %    1.1 GiB
        Target size (adaptive):                       100.0 %    1.0 GiB
        Min size (hard limit):                         23.5 %  240.8 MiB
        Max size (high water):                            4:1    1.0 GiB
        Most Frequently Used (MFU) cache size:         39.7 %  199.4 MiB
        Most Recently Used (MRU) cache size:           60.3 %  302.6 MiB
        Metadata cache size (hard limit):              75.0 %  768.0 MiB
        Metadata cache size (current):                148.5 %    1.1 GiB
        Dnode cache size (hard limit):                 10.0 %   76.8 MiB
        Dnode cache size (current):                   474.1 %  364.1 MiB

ARC hash breakdown:
        Elements max:                                             432.7k
        Elements current:                               3.2 %      13.9k
        Collisions:                                               364.3k
        Chain max:                                                     5
        Chains:                                                       98

ARC misc:
        Deleted:                                                    3.1M
        Mutex misses:                                               1.7k
        Eviction skips:                                            40.1M

ARC total accesses (hits + misses):                               706.8M
        Cache hit ratio:                               99.6 %     704.0M
        Cache miss ratio:                               0.4 %       2.7M
        Actual hit ratio (MFU + MRU hits):             99.6 %     703.6M
        Data demand efficiency:                        35.3 %     737.5k
        Data prefetch efficiency:                       4.5 %     251.0k

Cache hits by cache type:
        Most frequently used (MFU):                    98.0 %     690.0M
        Most recently used (MRU):                       1.9 %      13.6M
        Most frequently used (MFU) ghost:             < 0.1 %       1.8k
        Most recently used (MRU) ghost:               < 0.1 %      17.7k
        Anonymously used:                               0.1 %     402.7k

Cache hits by data type:
        Demand data:                                  < 0.1 %     260.5k
        Demand prefetch data:                         < 0.1 %      11.4k
        Demand metadata:                               99.9 %     703.2M
        Demand prefetch metadata:                       0.1 %     559.9k

Cache misses by data type:
        Demand data:                                   17.4 %     477.0k
        Demand prefetch data:                           8.7 %     239.6k
        Demand metadata:                               35.3 %     970.0k
        Demand prefetch metadata:                      38.6 %       1.1M

DMU prefetch efficiency:                                          439.6M
        Hit ratio:                                      0.1 %     235.6k
        Miss ratio:                                    99.9 %     439.3M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_expire                                          2
        spl_kmem_cache_kmem_limit                                   2048
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_obj_per_slab_min                                1
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                                    33554432
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes                           16777216
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        dmu_prefetch_max                                       134217728
        ignore_hole_birth                                              1
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                          524288
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_df_max_search                                  16777216
        metaslab_df_use_largest_segment                                0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_metadata                                       1
        spa_load_verify_shift                                          4
        spa_slop_shift                                                 5
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zap_iterate_prefetch                                           1
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                      8388608
        zfetch_max_streams                                             8
        zfetch_min_sec_reap                                            2
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     10
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                           1073741824
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                                    0
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_shrink_shift                                           0
        zfs_arc_sys_free                                               0
        zfs_async_block_max_blocks                                100000
        zfs_autoimport_disable                                         1
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                        wait
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_delay_min_dirty_percent                                   60
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                     808080588
        zfs_dirty_data_max_max                                2020201472
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       0
        zfs_expire_snapshot                                          300
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_immediate_write_sz                                     32768
        zfs_initialize_value                        16045690984833335022
        zfs_key_max_salt_uses                                  400000000
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_missing_tvds                                           0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_mg_fragmentation_threshold                                95
        zfs_mg_noalloc_threshold                                       0
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      1000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                                5
        zfs_prefetch_disable                                           0
        zfs_read_chunk_size                                      1048576
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_length                                   16777216
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_queue_length                                   16777216
        zfs_send_unmodified_spill_blocks                               1
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_special_class_metadata_reserve_pct                        25
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    8
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_trim_extent_bytes_max                              134217728
        zfs_trim_extent_bytes_min                                  32768
        zfs_trim_metaslab_skip                                         0
        zfs_trim_queue_limit                                          10
        zfs_trim_txg_batch                                            32
        zfs_txg_history                                              100
        zfs_txg_timeout                                                5
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregate_trim                                        0
        zfs_vdev_aggregation_limit                               1048576
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl   cycle [fastest] original scalar sse2 ssse3
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                        unused
        zfs_vdev_scrub_max_active                                      2
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1
        zfs_vdev_write_gap_limit                                    4096
        zfs_zevent_cols                                               80
        zfs_zevent_console                                             0
        zfs_zevent_len_max                                           512
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_maxblocksize                                          131072
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            0
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           75
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   1

VDEV cache disabled, skipping section

ZIL committed transactions:                                           16
        Commit requests:                                              62
        Flushes to stable storage:                                    62
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:        3.7 KiB          4


grep . /sys/module/zfs/parameters/zfs_arc_meta*
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts:4096
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:75
/sys/module/zfs/parameters/zfs_arc_meta_min:0
/sys/module/zfs/parameters/zfs_arc_meta_prune:10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy:1

mailinglists35 avatar Feb 12 '20 14:02 mailinglists35

the workaround had only temporary effects - arc_prune is spinning again

update: I had to increase zfs_arc_max to end spinning

mailinglists35 avatar Feb 12 '20 15:02 mailinglists35

eventually it's spinning again despite increasing zfs_arc_max

mailinglists35 avatar Feb 12 '20 15:02 mailinglists35

Hasn't happened here -- knock on wood -- but I'm done copying lots of files. Mine's got 64G RAM and 160G cache SSD, ARC size is left at the default. It's a 2x8 opteron and it didn't lock up until it had arc_prune spinning on every core. (I don't have a single one in top now.)

dmaziuk avatar Feb 12 '20 15:02 dmaziuk

I have seen this arc_prune behavior last year on one server with 32GB of ECC. I had to disable the weekly scrubs because of it. I think there is another thread where I commented about that. Since I have just moved to a new building and the network topology is completely different I can't test now.

drescherjm avatar Feb 12 '20 16:02 drescherjm

I'm seeing the issue again after reboot, on a simple zfs umount on the newly created pool

mailinglists35 avatar Feb 12 '20 16:02 mailinglists35

I thought I'd add a note that I have also seen this on a debian buster with the debian backports version on amd64. It did clear it self out after some time (hours if I remember right). If I see it again I'll try to get more information.

A couple of data points worth noting: no cache device, it only receives snapshots from a different system, little if any user io and no networks shares from it.

schallee avatar Feb 12 '20 20:02 schallee

I'm seeing a similar problem (arc_reclaim and arc_prune chewing up cpu). Bumping zfs_arc_limit_percent only provided temporary reprieve while arc_meta filled up. Bumping all the way to 100% doesn't fix the issue.

Why is there so much metadata in the ARC in the first place? If relevant, the machine in question is a dedicated backup machine, currently doing nothing other than be the receiving end of a lot of zfs send -i "backups".

System info:

russ@pth-hs1:~$ cat /sys/module/zfs/version
0.7.5-1ubuntu16.6
russ@pth-hs1:~$ uname -a
Linux pth-hs1 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
russ@pth-hs1:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

Here is the current state of the ARC (allowing 100% arc_meta, but still having prune issues):

russ@pth-hs1:~$ grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size                      4    8716864
arc_dnode_limit                 4    1684769382
russ@pth-hs1:~$ egrep "(arc_meta_used|arc_meta_limit)" /proc/spl/kstat/zfs/arcstats                        
arc_meta_used                   4    16847797336
arc_meta_limit                  4    16847693824
russ@pth-hs1:~$ grep -H . /sys/module/zfs/parameters/zfs_arc_{meta_limit{,_percent},dnode_limit{,_percent}}
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:100
/sys/module/zfs/parameters/zfs_arc_dnode_limit:0
/sys/module/zfs/parameters/zfs_arc_dnode_limit_percent:10
russ@pth-hs1:~$ cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 4608 41705217416 3820892427321077
name                            type data
hits                            4    7413219043
misses                          4    5467855407
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    7362177240
demand_metadata_misses          4    5453086223
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    51041803
prefetch_metadata_misses        4    14769184
mru_hits                        4    165815182
mru_ghost_hits                  4    1446761
mfu_hits                        4    7216088077
mfu_ghost_hits                  4    4321466
deleted                         4    112910070
mutex_miss                      4    40132
access_skip                     4    3387
evict_skip                      4    15984980671
evict_not_enough                4    22139723
evict_l2_cached                 4    0
evict_l2_eligible               4    948597889024
evict_l2_ineligible             4    1871813890048
evict_l2_skip                   4    0
hash_elements                   4    428907
hash_elements_max               4    2104109
hash_collisions                 4    43345413
hash_chains                     4    20493
hash_chain_max                  4    7
p                               4    16200329216
c                               4    16847693824
c_min                           4    1052980864
c_max                           4    16847693824
size                            4    16847639536
compressed_size                 4    16577940480
uncompressed_size               4    53259091456
overhead_size                   4    114052608
hdr_size                        4    141165920
data_size                       4    0
metadata_size                   4    16691993088
dbuf_size                       4    3742544
dnode_size                      4    8716864
bonus_size                      4    2021120
anon_size                       4    296960
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    31092736
mru_evictable_data              4    0
mru_evictable_metadata          4    27729920
mru_ghost_size                  4    66322432
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    66322432
mfu_size                        4    16660603392
mfu_evictable_data              4    0
mfu_evictable_metadata          4    16499196928
mfu_ghost_size                  4    21757952
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    21757952
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    33695387648
memory_free_bytes               4    13351628800
memory_available_bytes          3    12825141248
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    67235209
arc_meta_used                   4    16847639536
arc_meta_limit                  4    16847693824
arc_dnode_limit                 4    1684769382
arc_meta_max                    4    16913555592
arc_meta_min                    4    16777216
sync_wait_for_async             4    57
demand_hit_predictive_prefetch  4    2989
arc_need_free                   4    0
arc_sys_free                    4    526490432

rwarren avatar Mar 08 '20 09:03 rwarren

arc_prune settled down after a few hours and CPU usage is now back to normal levels. Given the timing I'm unclear if bumping zfs_arc_limit_percent to 100% was the difference maker, or if I just needed to wait for a few more hours?

Below is a crude view of the incident from a CPU load perspective across all cpus, where the middle marker is when I was bumping zfs_arc_limit_percent up, ultimately leaving it at 100%:

image

I don't know enough about ARC performance to know why metadata would be such a significant fraction, nor whether it is a long-term problem to move it from the default zfs_arc_limit_percent value of 75% to 100%.

rwarren avatar Mar 08 '20 15:03 rwarren

maybe related/duplicate: https://github.com/openzfs/zfs/issues/10119

devZer0 avatar Mar 11 '20 09:03 devZer0

For what it's worth, whenever I run a workload on my system that I know will peg the arc_prune threads, I run this in another terminal:

while true ; do zpool iostat 10 12 ; echo 2 > /proc/sys/vm/drop_caches ; done

It's blunt and brutal but stops those arc_prune threads from spending 50% of the cpus trying to acquire spinlocks. It might save others from reboots, anyway.

Thanks

setharnold avatar Apr 17 '20 23:04 setharnold

not for me, i already tried that then eventually it still comes to system hang

mailinglists35 avatar Apr 24 '20 23:04 mailinglists35

might be related to https://github.com/openzfs/zfs/issues/7559

alek-p avatar May 14 '20 20:05 alek-p

We experience recently the very same issue. arc_prune and arc_reclaim consumes a lots of CPU and all cores.

We can see this issue only a single machine. The biggest difference is that this is an AMD Epyc with a lots of cores: AMD EPYC 7452 32-Core Processor. Also it's worth to mention, this is an SSD only pool with no log and cache device:

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 214h25m with 0 errors on Mon Aug 10 20:25:45 2020
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    ata-INTEL_SSDSC2KB038T8_PHYF947405A33P8EGN  ONLINE       0     0     0
	    ata-INTEL_SSDSC2KB038T8_PHYF947403PT3P8EGN  ONLINE       0     0     0

errors: No known data errors

And what is also very interesting, that according to the monitoring system it started to happen only a couple of weeks ago, the machines was working properly for weeks before.

I also tried to increase arc_meta_limit with no success:

# cat /proc/spl/kstat/zfs/arcstats|grep arc_meta
arc_meta_used                   4    21485281368
arc_meta_limit                  4    16106127360
arc_meta_max                    4    21558749936
arc_meta_min                    4    16777216
# echo 21558749936 > /sys/module/zfs/parameters/zfs_arc_meta_limit
# cat /proc/spl/kstat/zfs/arcstats|grep arc_meta
arc_meta_used                   4    21530789416
arc_meta_limit                  4    16106127360
arc_meta_max                    4    21558749936
arc_meta_min                    4    16777216

BTW echo 3 > /proc/sys/vm/drop_caches also heals the machine for a short time.

Why is that?

In addition it's unclear to me. Will zfs_arc_meta_limit_percent=100 fix it or is it also just a temporary solution?

tomposmiko avatar Aug 13 '20 09:08 tomposmiko

@behlendorf sorry to ping you, but I read all the referenced issues without fully understand some important things:

  • why does arc_meta grow so much?
  • why can not arc_prune really reclaim sufficient memory (ie: it does no forward progress)?
  • what pinned buffers are and why can not be reclaimed by arc_reclaim?
  • what kind of workload can trigger the described issue?

I have yet to see a 100% arc_prune and/or arc_reclaim thread on my systems, but I would like to be prepared for the worse...

Thanks.

shodanshok avatar Aug 30 '20 12:08 shodanshok

On Sun, Aug 30, 2020 at 05:01:23AM -0700, shodanshok wrote:

  • what kind of workload can trigger the described issue?

In my case, I've got roughly one terabyte of unpacked source code on a system with only 128 gigabytes of memory and I want to use ripgrep to search through all the files.

I realize this is a silly workload.

Because it runs faster with a loop like this running at the same time I have some hope that the situation can improve with the clever application of programming:

while true ; do sleep 300 ; echo 2 > /proc/sys/vm/drop_caches ; done

Thanks

setharnold avatar Sep 01 '20 01:09 setharnold

Possibly related: #10331 - fix dnode eviction typo in arc_evict_state() #10508 - Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10563 - dbuf cache size is 1/32nd what was intended #10600 - Revise ARC shrinker algorithm #10610 - Limit dbuf cache sizes based only on ARC target size by default #10618 - Restore ARC MFU/MRU pressure

chrisrd avatar Sep 03 '20 05:09 chrisrd

I have a PowerEdge R730 experiencing this problem since migrating to Ubuntu 20.04. The problem didn't occur with the very same pool under CentOS 8. It's a Veeam Backup Repository and the problem occurs daily causing connections to timeout and the respective backups to fail. My current solution is daily reboots.

This data was gathered after today's reboot. I'll gather again tomorrow while the problem exists. $ cat /sys/module/zfs/version 0.8.3-1ubuntu12.4 $ grep -H . /sys/module/zfs/parameters/zfs_arc_{meta_limit{,_percent},dnode_limit{,_percent}} /sys/module/zfs/parameters/zfs_arc_meta_limit:0 /sys/module/zfs/parameters/zfs_arc_meta_limit_percent:75 /sys/module/zfs/parameters/zfs_arc_dnode_limit:0 /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent:10 $ grep dnode /proc/spl/kstat/zfs/arcstats dnode_size 4 14965008 arc_dnode_limit 4 2529106329

zpool status.txt arcstats.txt arc_summary.txt lshw.zip

4fd81048-Brian avatar Oct 12 '20 12:10 4fd81048-Brian