zfs
zfs copied to clipboard
100% CPU load from arc_prune
System information
Type | Version/Name |
---|---|
Distribution Name | centos |
Distribution Version | CentOS release 6.10 (Final) |
Linux Kernel | Linux hereland.nmrfam.wisc.edu 2.6.32-754.25.1.el6.x86_64 #1 SMP Mon Dec 23 15:19:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Architecture | AMD 64 |
ZFS Version | 0.8.3-1 |
SPL Version | 0.8.3-1 |
Describe the problem you're observing
Moving 14TB of external data into a zfs dataset, after a couple of days the CPU is at 99% system load. Clients time out trying to mount the dataset (exported via sharenfs
).
Describe how to reproduce the problem
It's the first one for me, no idea if it's reproducible.
Include any warning/errors/backtraces from the system logs
top - 09:49:18 up 3 days, 21:38, 3 users, load average: 48.27, 47.56, 47.73
Tasks: 653 total, 31 running, 622 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 99.0%sy, 0.0%ni, 1.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32993392k total, 28381572k used, 4611820k free, 66360k buffers
Swap: 65535996k total, 57400k used, 65478596k free, 1500656k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20905 root 20 0 0 0 0 R 98.7 0.0 167:51.80 arc_prune
7821 root 20 0 0 0 0 R 98.4 0.0 212:08.44 arc_prune
8292 root 20 0 0 0 0 R 98.4 0.0 70:27.30 arc_prune
20589 root 20 0 0 0 0 R 98.4 0.0 168:11.97 arc_prune
22506 root 20 0 0 0 0 R 98.4 0.0 166:39.25 arc_prune
1772 root 20 0 0 0 0 R 98.1 0.0 475:47.11 arc_prune
8502 root 20 0 0 0 0 R 98.1 0.0 212:09.17 arc_prune
18799 root 20 0 0 0 0 R 98.1 0.0 169:55.35 arc_prune
19397 root 20 0 0 0 0 R 98.1 0.0 169:08.63 arc_prune
20312 root 20 0 0 0 0 R 98.1 0.0 168:14.17 arc_prune
23623 root 20 0 0 0 0 R 98.1 0.0 164:48.99 arc_prune
28861 root 20 0 0 0 0 R 98.1 0.0 137:44.49 arc_prune
21294 root 20 0 0 0 0 R 97.8 0.0 167:55.44 arc_prune
19192 root 20 0 0 0 0 R 96.5 0.0 4:46.02 arc_prune
23087 root 20 0 0 0 0 R 95.8 0.0 165:28.73 arc_prune
19802 root 20 0 0 0 0 R 94.9 0.0 168:58.81 arc_prune
1773 root 39 19 0 0 0 R 6.1 0.0 84:38.77 zthr_procedure
19929 root 20 0 15432 1724 944 R 1.9 0.0 0:00.50 top
More info:
# zpool status
pool: backups
state: ONLINE
scan: scrub repaired 0B in 0 days 00:10:52 with 0 errors on Sat Feb 1 00:10:54 2020
config:
NAME STATE READ WRITE CKSUM
backups ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-35000c500860d0c4b ONLINE 0 0 0
scsi-35000c500860d1543 ONLINE 0 0 0
scsi-35000c500860f9437 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0B in 0 days 21:22:47 with 0 errors on Sat Feb 1 21:22:49 2020
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-35000c500860ba403 ONLINE 0 0 0
scsi-35000c500860bbb4f ONLINE 0 0 0
scsi-35000c500860bbb5f ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
scsi-35000c500860bcb77 ONLINE 0 0 0
scsi-35000c500860cf9d7 ONLINE 0 0 0
scsi-35000c500860d0b67 ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
scsi-35000c500860f96db ONLINE 0 0 0
scsi-35000c500860fc85f ONLINE 0 0 0
scsi-35000c500860fd93b ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
scsi-35000c500301bf47f-part1 ONLINE 0 0 0
scsi-35000c500301bf69f-part1 ONLINE 0 0 0
cache
scsi-35000c500301bf6c3-part1 ONLINE 0 0 0
errors: No known data errors
# zpool iostat
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
backups 402G 21.4T 1 0 6.71K 204K
tank 31.9T 33.5T 783 217 33.8M 26.7M
---------- ----- ----- ----- ----- ----- -----
ARC_MAX is not set, it's left at the default. Cache SSD is 160GB. All disks are dual-ported SAS drives in an external shelf.
Even more info:
...
Feb 7 04:55:40 hereland zed[6190]: Invoking "all-syslog.sh" eid=180 pid=16962
Feb 7 04:55:40 hereland zed: eid=180 class=history_event pool_guid=0x68C093597CA7900F
Feb 7 04:55:40 hereland zed[6190]: Finished "all-syslog.sh" eid=180 pid=16962 exit=0
Feb 7 04:55:41 hereland zed[6190]: Invoking "all-syslog.sh" eid=181 pid=17099
Feb 7 04:55:41 hereland zed: eid=181 class=history_event pool_guid=0x68C093597CA7900F
Feb 7 04:55:41 hereland zed[6190]: Finished "all-syslog.sh" eid=181 pid=17099 exit=0
Feb 7 06:14:42 hereland kernel: INFO: task nfsd:6109 blocked for more than 120 seconds.
...
There's nothing out of the ordinary in syslog until the "hung task" nfsd
.
Does it look like I'm getting hit by "don't make the cache disk bigger than 5 x 1/2 RAM" problem? I thought that was no longer a thing.
@dmaziuk that's a possibility. Can you post the contents of the /proc/spl/kstat/zfs/arcstats
file, it should give us a better idea why the system is so aggressively trying to reclaim from the ARC.
# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14203091063 342692306280336
name type data
hits 4 1961861302
misses 4 214595515
demand_data_hits 4 535849124
demand_data_misses 4 21425564
demand_metadata_hits 4 1310503883
demand_metadata_misses 4 29241451
prefetch_data_hits 4 1327
prefetch_data_misses 4 146730344
prefetch_metadata_hits 4 115506968
prefetch_metadata_misses 4 17198156
mru_hits 4 760224963
mru_ghost_hits 4 1685841
mfu_hits 4 1094334124
mfu_ghost_hits 4 1919548
deleted 4 294724177
mutex_miss 4 24519247
access_skip 4 701
evict_skip 4 554617006
evict_not_enough 4 75691673
evict_l2_cached 4 8277499373056
evict_l2_eligible 4 14072033056256
evict_l2_ineligible 4 9966615764992
evict_l2_skip 4 4031567
hash_elements 4 2149096
hash_elements_max 4 6808116
hash_collisions 4 222109741
hash_chains 4 394313
hash_chain_max 4 11
p 4 15524174769
c 4 16823934976
c_min 4 1051495936
c_max 4 16823934976
size 4 16951797592
compressed_size 4 1377493504
uncompressed_size 4 4542341632
overhead_size 4 4540340224
hdr_size 4 101469040
data_size 4 742869504
metadata_size 4 5174964224
dbuf_size 4 2256648576
dnode_size 4 6258464128
bonus_size 4 2234942720
anon_size 4 705024
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 1051861504
mru_evictable_data 4 0
mru_evictable_metadata 4 8192
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 4865267200
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 3173912
l2_misses 4 211421480
l2_feeds 4 786569
l2_rw_clash 4 16
l2_read_bytes 4 11567827968
l2_write_bytes 4 4970878013440
l2_writes_sent 4 689715
l2_writes_done 4 689715
l2_writes_error 4 0
l2_writes_lock_retry 4 1425
l2_evict_lock_retry 4 1161
l2_evict_reading 4 37
l2_evict_l1cached 4 1921025
l2_free_on_write 4 385458
l2_abort_lowmem 4 119
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 182019350528
l2_asize 4 140228637696
l2_hdr_size 4 182439072
memory_throttle_count 4 0
memory_direct_count 4 143
memory_indirect_count 4 1174580
memory_all_bytes 4 33785233408
memory_free_bytes 4 7054409728
memory_available_bytes 3 6528663552
arc_no_grow 4 0
arc_tempreserve 4 9216
arc_loaned_bytes 4 0
arc_prune 4 13876726
arc_meta_used 4 16208928088
arc_meta_limit 4 12617951232
arc_dnode_limit 4 1261795123
arc_meta_max 4 16893738168
arc_meta_min 4 16777216
async_upgrade_sync 4 1740388
demand_hit_predictive_prefetch 4 67124685
demand_hit_prescient_prefetch 4 83590118
arc_need_free 4 0
arc_sys_free 4 525747968
arc_raw_size 4 0
System load is down to 95-ish% now BTW (but nfsd
is still hosed)
According to the arcstats the ARC is entirely filled with metadata and is over the target (arc_meta_used > arc_meta_limit). The arc_prune threads are trying to to evict some of this metadata to get down to the 75% target value.
arc_meta_used 4 16208928088
arc_meta_limit 4 12617951232
They appear to be unable to make any progress on this which is why they're spinning. If I recall correctly, this was caused by the file handles cached by the nfsd holding a reference which prevents the ARC form freeing the buffers. I believe the nfsd behavior here is a little different in newer kernels and better behaved.
You should be able to stop the spinning by setting the limit to 100%. You can make this the default with the zfs_arc_meta_limit_percent=100
module option. But you'll need to reload the ZFS modules for that to take effect. To change the limit immediately, you can set zfs_arc_meta_limit
to the arc_meta_max
.
echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit
echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit
Done, arc_meta_used
is going down now, but very slowly, with occasional dips in system load down to 80%-ish.
@behlendorf if you need anything else from this system, say so because my users are crying, I have to bounce it.
Could it be possible to make it not use all available cores? -- it's a 16-core system that has 16 instances of arc_prune
. I wonder if it should be limited to e.g. half the cores, like arc is limited to half the RAM by default.
Feel free to bounce the system, the arcstats are enough to confirm the issue. It definitely can, and should, be improved. We have some ideas about how to improve things.
Is this from RAM ARC or cache SSD ARC? I.e. if it's the latter, can I get around it by disconnecting cache device? -- Temporarily, while moving a zillion files around.
It's the RAM ARC, disconnecting the cache device won't help. Setting zfs_arc_meta_limit_percent=100
should avoid the spinning.
OK, thank you. I did set in /etc/modprobe.d/zfs.conf
though I really don't expect to hit this again. Unless another group moves in with 8 years worth of data that they think they want on our "big" file server.
Thanks again
To change the limit immediately, you can set zfs_arc_meta_limit to the arc_meta_max
@behlendorf thanks for providing the hint, it made the spinning go away for me as well! below were the parameters/statistics on my system when spinning occured (happened during send/receive to/from local pools)
how do you plan to solve this by default in the future without needing user intervention?
root@zfs:~# systool -vm spl
Module = "spl"
Attributes:
coresize = "106496"
initsize = "0"
initstate = "live"
refcnt = "5"
srcversion = "09B6CA3EE9E5F5E76F82D5C"
taint = "OE"
uevent = <store method only>
version = "0.8.3-1~rlaager1.18.04.1"
Parameters:
spl_hostid = "0"
spl_hostid_path = "/etc/hostid"
spl_kmem_alloc_max = "1048576"
spl_kmem_alloc_warn = "65536"
spl_kmem_cache_expire= "2"
spl_kmem_cache_kmem_limit= "2048"
spl_kmem_cache_kmem_threads= "4"
spl_kmem_cache_magazine_size= "0"
spl_kmem_cache_max_size= "32"
spl_kmem_cache_obj_per_slab_min= "1"
spl_kmem_cache_obj_per_slab= "8"
spl_kmem_cache_reclaim= "0"
spl_kmem_cache_slab_limit= "16384"
spl_max_show_tasks = "512"
spl_panic_halt = "0"
spl_schedule_hrtimeout_slack_us= "0"
spl_taskq_kick = "0"
spl_taskq_thread_bind= "0"
spl_taskq_thread_dynamic= "1"
spl_taskq_thread_priority= "1"
spl_taskq_thread_sequential= "4"
Sections:
.altinstr_replacement= "0xffffffffc049b795"
.altinstructions = "0xffffffffc049f082"
.bss = "0xffffffffc04a0ec0"
.data = "0xffffffffc04a0000"
.exit.text = "0xffffffffc049b762"
.gnu.linkonce.this_module= "0xffffffffc04a0b80"
.init.text = "0xffffffffc04ac000"
.note.gnu.build-id = "0xffffffffc049c000"
.parainstructions = "0xffffffffc049d358"
.rodata = "0xffffffffc049c820"
.rodata.str1.1 = "0xffffffffc049cdc6"
.rodata.str1.8 = "0xffffffffc049e918"
.smp_locks = "0xffffffffc049d2c4"
.strtab = "0xffffffffc04b2160"
.symtab = "0xffffffffc04ad000"
.text = "0xffffffffc0491000"
__bug_table = "0xffffffffc04a0b60"
__ksymtab = "0xffffffffc049c030"
__ksymtab_strings = "0xffffffffc049d7d4"
__mcount_loc = "0xffffffffc049e238"
__param = "0xffffffffc049def0"
root@zfs:~# systool -vm zfs
Module = "zfs"
Attributes:
coresize = "3854336"
initsize = "0"
initstate = "live"
refcnt = "38"
srcversion = "C6177AA5049CC30B672B1CA"
taint = "POE"
uevent = <store method only>
version = "0.8.3-1~rlaager1.18.04.1"
Parameters:
dbuf_cache_hiwater_pct= "10"
dbuf_cache_lowater_pct= "10"
dbuf_cache_max_bytes= "33554432"
dbuf_cache_shift = "5"
dbuf_metadata_cache_max_bytes= "16777216"
dbuf_metadata_cache_shift= "6"
dmu_object_alloc_chunk_shift= "7"
dmu_prefetch_max = "134217728"
ignore_hole_birth = "1"
l2arc_feed_again = "1"
l2arc_feed_min_ms = "200"
l2arc_feed_secs = "1"
l2arc_headroom = "2"
l2arc_headroom_boost= "200"
l2arc_noprefetch = "1"
l2arc_norw = "0"
l2arc_write_boost = "8388608"
l2arc_write_max = "8388608"
metaslab_aliquot = "524288"
metaslab_bias_enabled= "1"
metaslab_debug_load = "0"
metaslab_debug_unload= "0"
metaslab_df_max_search= "16777216"
metaslab_df_use_largest_segment= "0"
metaslab_force_ganging= "16777217"
metaslab_fragmentation_factor_enabled= "1"
metaslab_lba_weighting_enabled= "1"
metaslab_preload_enabled= "1"
send_holes_without_birth_time= "1"
spa_asize_inflation = "24"
spa_config_path = "/etc/zfs/zpool.cache"
spa_load_print_vdev_tree= "0"
spa_load_verify_data= "1"
spa_load_verify_metadata= "1"
spa_load_verify_shift= "4"
spa_slop_shift = "5"
vdev_removal_max_span= "32768"
vdev_validate_skip = "0"
zap_iterate_prefetch= "1"
zfetch_array_rd_sz = "1048576"
zfetch_max_distance = "8388608"
zfetch_max_streams = "8"
zfetch_min_sec_reap = "2"
zfs_abd_scatter_enabled= "1"
zfs_abd_scatter_max_order= "10"
zfs_abd_scatter_min_size= "1536"
zfs_admin_snapshot = "0"
zfs_arc_average_blocksize= "8192"
zfs_arc_dnode_limit = "0"
zfs_arc_dnode_limit_percent= "10"
zfs_arc_dnode_reduce_percent= "10"
zfs_arc_grow_retry = "0"
zfs_arc_lotsfree_percent= "10"
zfs_arc_max = "1073741824"
zfs_arc_meta_adjust_restarts= "4096"
zfs_arc_meta_limit_percent= "75"
zfs_arc_meta_limit = "0"
zfs_arc_meta_min = "0"
zfs_arc_meta_prune = "10000"
zfs_arc_meta_strategy= "1"
zfs_arc_min_prefetch_ms= "0"
zfs_arc_min_prescient_prefetch_ms= "0"
zfs_arc_min = "0"
zfs_arc_p_dampener_disable= "1"
zfs_arc_p_min_shift = "0"
zfs_arc_pc_percent = "0"
zfs_arc_shrink_shift= "0"
zfs_arc_sys_free = "0"
zfs_async_block_max_blocks= "100000"
zfs_autoimport_disable= "1"
zfs_checksum_events_per_second= "20"
zfs_commit_timeout_pct= "5"
zfs_compressed_arc_enabled= "1"
zfs_condense_indirect_commit_entry_delay_ms= "0"
zfs_condense_indirect_vdevs_enable= "1"
zfs_condense_max_obsolete_bytes= "1073741824"
zfs_condense_min_mapping_bytes= "131072"
zfs_dbgmsg_enable = "1"
zfs_dbgmsg_maxsize = "4194304"
zfs_dbuf_state_index= "0"
zfs_ddt_data_is_special= "1"
zfs_deadman_checktime_ms= "60000"
zfs_deadman_enabled = "1"
zfs_deadman_failmode= "wait"
zfs_deadman_synctime_ms= "600000"
zfs_deadman_ziotime_ms= "300000"
zfs_dedup_prefetch = "0"
zfs_delay_min_dirty_percent= "60"
zfs_delay_scale = "500000"
zfs_delete_blocks = "20480"
zfs_dirty_data_max_max= "2020201472"
zfs_dirty_data_max_max_percent= "25"
zfs_dirty_data_max_percent= "10"
zfs_dirty_data_max = "808080588"
zfs_dirty_data_sync_percent= "20"
zfs_disable_ivset_guid_check= "0"
zfs_dmu_offset_next_sync= "0"
zfs_expire_snapshot = "300"
zfs_flags = "0"
zfs_free_bpobj_enabled= "1"
zfs_free_leak_on_eio= "0"
zfs_free_min_time_ms= "1000"
zfs_immediate_write_sz= "32768"
zfs_initialize_value= "16045690984833335022"
zfs_key_max_salt_uses= "400000000"
zfs_lua_max_instrlimit= "100000000"
zfs_lua_max_memlimit= "104857600"
zfs_max_missing_tvds= "0"
zfs_max_recordsize = "1048576"
zfs_metaslab_fragmentation_threshold= "70"
zfs_metaslab_segment_weight_enabled= "1"
zfs_metaslab_switch_threshold= "2"
zfs_mg_fragmentation_threshold= "95"
zfs_mg_noalloc_threshold= "0"
zfs_multihost_fail_intervals= "10"
zfs_multihost_history= "0"
zfs_multihost_import_intervals= "20"
zfs_multihost_interval= "1000"
zfs_multilist_num_sublists= "0"
zfs_no_scrub_io = "0"
zfs_no_scrub_prefetch= "0"
zfs_nocacheflush = "0"
zfs_nopwrite_enabled= "1"
zfs_object_mutex_size= "64"
zfs_obsolete_min_time_ms= "500"
zfs_override_estimate_recordsize= "0"
zfs_pd_bytes_max = "52428800"
zfs_per_txg_dirty_frees_percent= "5"
zfs_prefetch_disable= "0"
zfs_read_chunk_size = "1048576"
zfs_read_history = "0"
zfs_read_history_hits= "0"
zfs_reconstruct_indirect_combinations_max= "4096"
zfs_recover = "0"
zfs_recv_queue_length= "16777216"
zfs_removal_ignore_errors= "0"
zfs_removal_suspend_progress= "0"
zfs_remove_max_segment= "16777216"
zfs_resilver_disable_defer= "0"
zfs_resilver_min_time_ms= "3000"
zfs_scan_checkpoint_intval= "7200"
zfs_scan_fill_weight= "3"
zfs_scan_ignore_errors= "0"
zfs_scan_issue_strategy= "0"
zfs_scan_legacy = "0"
zfs_scan_max_ext_gap= "2097152"
zfs_scan_mem_lim_fact= "20"
zfs_scan_mem_lim_soft_fact= "20"
zfs_scan_strict_mem_lim= "0"
zfs_scan_suspend_progress= "0"
zfs_scan_vdev_limit = "4194304"
zfs_scrub_min_time_ms= "1000"
zfs_send_corrupt_data= "0"
zfs_send_queue_length= "16777216"
zfs_send_unmodified_spill_blocks= "1"
zfs_slow_io_events_per_second= "20"
zfs_spa_discard_memory_limit= "16777216"
zfs_special_class_metadata_reserve_pct= "25"
zfs_sync_pass_deferred_free= "2"
zfs_sync_pass_dont_compress= "8"
zfs_sync_pass_rewrite= "2"
zfs_sync_taskq_batch_pct= "75"
zfs_trim_extent_bytes_max= "134217728"
zfs_trim_extent_bytes_min= "32768"
zfs_trim_metaslab_skip= "0"
zfs_trim_queue_limit= "10"
zfs_trim_txg_batch = "32"
zfs_txg_history = "100"
zfs_txg_timeout = "5"
zfs_unlink_suspend_progress= "0"
zfs_user_indirect_is_special= "1"
zfs_vdev_aggregate_trim= "0"
zfs_vdev_aggregation_limit_non_rotating= "131072"
zfs_vdev_aggregation_limit= "1048576"
zfs_vdev_async_read_max_active= "3"
zfs_vdev_async_read_min_active= "1"
zfs_vdev_async_write_active_max_dirty_percent= "60"
zfs_vdev_async_write_active_min_dirty_percent= "30"
zfs_vdev_async_write_max_active= "10"
zfs_vdev_async_write_min_active= "2"
zfs_vdev_cache_bshift= "16"
zfs_vdev_cache_max = "16384"
zfs_vdev_cache_size = "0"
zfs_vdev_default_ms_count= "200"
zfs_vdev_initializing_max_active= "1"
zfs_vdev_initializing_min_active= "1"
zfs_vdev_max_active = "1000"
zfs_vdev_min_ms_count= "16"
zfs_vdev_mirror_non_rotating_inc= "0"
zfs_vdev_mirror_non_rotating_seek_inc= "1"
zfs_vdev_mirror_rotating_inc= "0"
zfs_vdev_mirror_rotating_seek_inc= "5"
zfs_vdev_mirror_rotating_seek_offset= "1048576"
zfs_vdev_ms_count_limit= "131072"
zfs_vdev_queue_depth_pct= "1000"
zfs_vdev_raidz_impl = "cycle [fastest] original scalar sse2 ssse3 "
zfs_vdev_read_gap_limit= "32768"
zfs_vdev_removal_max_active= "2"
zfs_vdev_removal_min_active= "1"
zfs_vdev_scheduler = "unused"
zfs_vdev_scrub_max_active= "2"
zfs_vdev_scrub_min_active= "1"
zfs_vdev_sync_read_max_active= "10"
zfs_vdev_sync_read_min_active= "10"
zfs_vdev_sync_write_max_active= "10"
zfs_vdev_sync_write_min_active= "10"
zfs_vdev_trim_max_active= "2"
zfs_vdev_trim_min_active= "1"
zfs_vdev_write_gap_limit= "4096"
zfs_zevent_cols = "80"
zfs_zevent_console = "0"
zfs_zevent_len_max = "512"
zfs_zil_clean_taskq_maxalloc= "1048576"
zfs_zil_clean_taskq_minalloc= "1024"
zfs_zil_clean_taskq_nthr_pct= "100"
zil_maxblocksize = "131072"
zil_nocacheflush = "0"
zil_replay_disable = "0"
zil_slog_bulk = "786432"
zio_deadman_log_all = "0"
zio_dva_throttle_enabled= "1"
zio_requeue_io_start_cut_in_line= "1"
zio_slow_io_ms = "30000"
zio_taskq_batch_pct = "75"
zvol_inhibit_dev = "0"
zvol_major = "230"
zvol_max_discard_blocks= "16384"
zvol_prefetch_bytes = "131072"
zvol_request_sync = "0"
zvol_threads = "32"
zvol_volmode = "1"
Sections:
.bss = "0xffffffffc07d2100"
.data = "0xffffffffc07c0060"
.exit.text = "0xffffffffc07738c9"
.gnu.linkonce.this_module= "0xffffffffc07d1dc0"
.init.text = "0xffffffffc09a7000"
.note.gnu.build-id = "0xffffffffc0774000"
.parainstructions = "0xffffffffc07aec00"
.rodata = "0xffffffffc0776800"
.rodata.str1.1 = "0xffffffffc078de1c"
.rodata.str1.8 = "0xffffffffc0792cb0"
.smp_locks = "0xffffffffc078d684"
.strtab = "0xffffffffc09e1178"
.symtab = "0xffffffffc09a9000"
.text = "0xffffffffc05f9000"
.text.unlikely = "0xffffffffc07731d8"
__bug_table = "0xffffffffc07d1da4"
__jump_table = "0xffffffffc07c0000"
__ksymtab = "0xffffffffc0774030"
__ksymtab_strings = "0xffffffffc07bd148"
__mcount_loc = "0xffffffffc07b4a28"
__param = "0xffffffffc07b2610"
root@zfs:~# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14093184634 1224910481851349
name type data
hits 4 704041195
misses 4 2744949
demand_data_hits 4 260501
demand_data_misses 4 476953
demand_metadata_hits 4 703209429
demand_metadata_misses 4 969970
prefetch_data_hits 4 11386
prefetch_data_misses 4 239621
prefetch_metadata_hits 4 559879
prefetch_metadata_misses 4 1058405
mru_hits 4 13617271
mru_ghost_hits 4 17737
mfu_hits 4 690001738
mfu_ghost_hits 4 1765
deleted 4 3101597
mutex_miss 4 1677
access_skip 4 5
evict_skip 4 39643211
evict_not_enough 4 7411892
evict_l2_cached 4 0
evict_l2_eligible 4 21536072704
evict_l2_ineligible 4 40198401024
evict_l2_skip 4 0
hash_elements 4 13708
hash_elements_max 4 432738
hash_collisions 4 364283
hash_chains 4 94
hash_chain_max 4 5
p 4 689080832
c 4 1073741824
c_min 4 252525184
c_max 4 1073741824
size 4 1222968480
compressed_size 4 295622656
uncompressed_size 4 332941312
overhead_size 4 273108480
hdr_size 4 6912992
data_size 4 33457664
metadata_size 4 535273472
dbuf_size 4 133838016
dnode_size 4 380898624
bonus_size 4 132587712
anon_size 4 50364416
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 310185984
mru_evictable_data 4 0
mru_evictable_metadata 4 0
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 208180736
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
memory_all_bytes 4 8080805888
memory_free_bytes 4 3879395328
memory_available_bytes 3 3753136128
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 15495680
arc_prune 4 11107355
arc_meta_used 4 1189510816
arc_meta_limit 4 805306368
arc_dnode_limit 4 80530636
arc_meta_max 4 1189630648
arc_meta_min 4 16777216
async_upgrade_sync 4 1796
demand_hit_predictive_prefetch 4 77527
demand_hit_prescient_prefetch 4 1448691
arc_need_free 4 0
arc_sys_free 4 126262592
arc_raw_size 4 195579904
root@zfs:~# arc_summary
------------------------------------------------------------------------
ZFS Subsystem Report Wed Feb 12 16:33:48 2020
Linux 4.15.0-76-generic 0.8.3-1~rlaager1.18.04.1
Machine: zfs (x86_64) 0.8.3-1~rlaager1.18.04.1
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 114.5 % 1.1 GiB
Target size (adaptive): 100.0 % 1.0 GiB
Min size (hard limit): 23.5 % 240.8 MiB
Max size (high water): 4:1 1.0 GiB
Most Frequently Used (MFU) cache size: 39.7 % 199.4 MiB
Most Recently Used (MRU) cache size: 60.3 % 302.6 MiB
Metadata cache size (hard limit): 75.0 % 768.0 MiB
Metadata cache size (current): 148.5 % 1.1 GiB
Dnode cache size (hard limit): 10.0 % 76.8 MiB
Dnode cache size (current): 474.1 % 364.1 MiB
ARC hash breakdown:
Elements max: 432.7k
Elements current: 3.2 % 13.9k
Collisions: 364.3k
Chain max: 5
Chains: 98
ARC misc:
Deleted: 3.1M
Mutex misses: 1.7k
Eviction skips: 40.1M
ARC total accesses (hits + misses): 706.8M
Cache hit ratio: 99.6 % 704.0M
Cache miss ratio: 0.4 % 2.7M
Actual hit ratio (MFU + MRU hits): 99.6 % 703.6M
Data demand efficiency: 35.3 % 737.5k
Data prefetch efficiency: 4.5 % 251.0k
Cache hits by cache type:
Most frequently used (MFU): 98.0 % 690.0M
Most recently used (MRU): 1.9 % 13.6M
Most frequently used (MFU) ghost: < 0.1 % 1.8k
Most recently used (MRU) ghost: < 0.1 % 17.7k
Anonymously used: 0.1 % 402.7k
Cache hits by data type:
Demand data: < 0.1 % 260.5k
Demand prefetch data: < 0.1 % 11.4k
Demand metadata: 99.9 % 703.2M
Demand prefetch metadata: 0.1 % 559.9k
Cache misses by data type:
Demand data: 17.4 % 477.0k
Demand prefetch data: 8.7 % 239.6k
Demand metadata: 35.3 % 970.0k
Demand prefetch metadata: 38.6 % 1.1M
DMU prefetch efficiency: 439.6M
Hit ratio: 0.1 % 235.6k
Miss ratio: 99.9 % 439.3M
L2ARC not detected, skipping section
Solaris Porting Layer (SPL):
spl_hostid 0
spl_hostid_path /etc/hostid
spl_kmem_alloc_max 1048576
spl_kmem_alloc_warn 65536
spl_kmem_cache_expire 2
spl_kmem_cache_kmem_limit 2048
spl_kmem_cache_kmem_threads 4
spl_kmem_cache_magazine_size 0
spl_kmem_cache_max_size 32
spl_kmem_cache_obj_per_slab 8
spl_kmem_cache_obj_per_slab_min 1
spl_kmem_cache_reclaim 0
spl_kmem_cache_slab_limit 16384
spl_max_show_tasks 512
spl_panic_halt 0
spl_schedule_hrtimeout_slack_us 0
spl_taskq_kick 0
spl_taskq_thread_bind 0
spl_taskq_thread_dynamic 1
spl_taskq_thread_priority 1
spl_taskq_thread_sequential 4
Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 33554432
dbuf_cache_shift 5
dbuf_metadata_cache_max_bytes 16777216
dbuf_metadata_cache_shift 6
dmu_object_alloc_chunk_shift 7
dmu_prefetch_max 134217728
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_df_max_search 16777216
metaslab_df_use_largest_segment 0
metaslab_force_ganging 16777217
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_print_vdev_tree 0
spa_load_verify_data 1
spa_load_verify_metadata 1
spa_load_verify_shift 4
spa_slop_shift 5
vdev_removal_max_span 32768
vdev_validate_skip 0
zap_iterate_prefetch 1
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_abd_scatter_min_size 1536
zfs_admin_snapshot 0
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 1073741824
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 0
zfs_arc_min_prefetch_ms 0
zfs_arc_min_prescient_prefetch_ms 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_async_block_max_blocks 100000
zfs_autoimport_disable 1
zfs_checksum_events_per_second 20
zfs_commit_timeout_pct 5
zfs_compressed_arc_enabled 1
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_condense_indirect_vdevs_enable 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_condense_min_mapping_bytes 131072
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_ddt_data_is_special 1
zfs_deadman_checktime_ms 60000
zfs_deadman_enabled 1
zfs_deadman_failmode wait
zfs_deadman_synctime_ms 600000
zfs_deadman_ziotime_ms 300000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delete_blocks 20480
zfs_dirty_data_max 808080588
zfs_dirty_data_max_max 2020201472
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync_percent 20
zfs_disable_ivset_guid_check 0
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_initialize_value 16045690984833335022
zfs_key_max_salt_uses 400000000
zfs_lua_max_instrlimit 100000000
zfs_lua_max_memlimit 104857600
zfs_max_missing_tvds 0
zfs_max_recordsize 1048576
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 95
zfs_mg_noalloc_threshold 0
zfs_multihost_fail_intervals 10
zfs_multihost_history 0
zfs_multihost_import_intervals 20
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_obsolete_min_time_ms 500
zfs_override_estimate_recordsize 0
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 5
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_reconstruct_indirect_combinations_max 4096
zfs_recover 0
zfs_recv_queue_length 16777216
zfs_removal_ignore_errors 0
zfs_removal_suspend_progress 0
zfs_remove_max_segment 16777216
zfs_resilver_disable_defer 0
zfs_resilver_min_time_ms 3000
zfs_scan_checkpoint_intval 7200
zfs_scan_fill_weight 3
zfs_scan_ignore_errors 0
zfs_scan_issue_strategy 0
zfs_scan_legacy 0
zfs_scan_max_ext_gap 2097152
zfs_scan_mem_lim_fact 20
zfs_scan_mem_lim_soft_fact 20
zfs_scan_strict_mem_lim 0
zfs_scan_suspend_progress 0
zfs_scan_vdev_limit 4194304
zfs_scrub_min_time_ms 1000
zfs_send_corrupt_data 0
zfs_send_queue_length 16777216
zfs_send_unmodified_spill_blocks 1
zfs_slow_io_events_per_second 20
zfs_spa_discard_memory_limit 16777216
zfs_special_class_metadata_reserve_pct 25
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 8
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_trim_extent_bytes_max 134217728
zfs_trim_extent_bytes_min 32768
zfs_trim_metaslab_skip 0
zfs_trim_queue_limit 10
zfs_trim_txg_batch 32
zfs_txg_history 100
zfs_txg_timeout 5
zfs_unlink_suspend_progress 0
zfs_user_indirect_is_special 1
zfs_vdev_aggregate_trim 0
zfs_vdev_aggregation_limit 1048576
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_default_ms_count 200
zfs_vdev_initializing_max_active 1
zfs_vdev_initializing_min_active 1
zfs_vdev_max_active 1000
zfs_vdev_min_ms_count 16
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_ms_count_limit 131072
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3
zfs_vdev_read_gap_limit 32768
zfs_vdev_removal_max_active 2
zfs_vdev_removal_min_active 1
zfs_vdev_scheduler unused
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_trim_max_active 2
zfs_vdev_trim_min_active 1
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 512
zfs_zil_clean_taskq_maxalloc 1048576
zfs_zil_clean_taskq_minalloc 1024
zfs_zil_clean_taskq_nthr_pct 100
zil_maxblocksize 131072
zil_nocacheflush 0
zil_replay_disable 0
zil_slog_bulk 786432
zio_deadman_log_all 0
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_slow_io_ms 30000
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1
VDEV cache disabled, skipping section
ZIL committed transactions: 16
Commit requests: 62
Flushes to stable storage: 62
Transactions to SLOG storage pool: 0 Bytes 0
Transactions to non-SLOG storage pool: 3.7 KiB 4
grep . /sys/module/zfs/parameters/zfs_arc_meta*
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts:4096
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:75
/sys/module/zfs/parameters/zfs_arc_meta_min:0
/sys/module/zfs/parameters/zfs_arc_meta_prune:10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy:1
the workaround had only temporary effects - arc_prune
is spinning again
update: I had to increase zfs_arc_max
to end spinning
eventually it's spinning again despite increasing zfs_arc_max
Hasn't happened here -- knock on wood -- but I'm done copying lots of files. Mine's got 64G RAM and 160G cache SSD, ARC size is left at the default. It's a 2x8 opteron and it didn't lock up until it had arc_prune
spinning on every core. (I don't have a single one in top
now.)
I have seen this arc_prune behavior last year on one server with 32GB of ECC. I had to disable the weekly scrubs because of it. I think there is another thread where I commented about that. Since I have just moved to a new building and the network topology is completely different I can't test now.
I'm seeing the issue again after reboot, on a simple zfs umount
on the newly created pool
I thought I'd add a note that I have also seen this on a debian buster with the debian backports version on amd64. It did clear it self out after some time (hours if I remember right). If I see it again I'll try to get more information.
A couple of data points worth noting: no cache device, it only receives snapshots from a different system, little if any user io and no networks shares from it.
I'm seeing a similar problem (arc_reclaim
and arc_prune
chewing up cpu). Bumping zfs_arc_limit_percent
only provided temporary reprieve while arc_meta filled up. Bumping all the way to 100% doesn't fix the issue.
Why is there so much metadata in the ARC in the first place? If relevant, the machine in question is a dedicated backup machine, currently doing nothing other than be the receiving end of a lot of zfs send -i
"backups".
System info:
russ@pth-hs1:~$ cat /sys/module/zfs/version
0.7.5-1ubuntu16.6
russ@pth-hs1:~$ uname -a
Linux pth-hs1 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
russ@pth-hs1:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
Here is the current state of the ARC (allowing 100% arc_meta, but still having prune issues):
russ@pth-hs1:~$ grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size 4 8716864
arc_dnode_limit 4 1684769382
russ@pth-hs1:~$ egrep "(arc_meta_used|arc_meta_limit)" /proc/spl/kstat/zfs/arcstats
arc_meta_used 4 16847797336
arc_meta_limit 4 16847693824
russ@pth-hs1:~$ grep -H . /sys/module/zfs/parameters/zfs_arc_{meta_limit{,_percent},dnode_limit{,_percent}}
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:100
/sys/module/zfs/parameters/zfs_arc_dnode_limit:0
/sys/module/zfs/parameters/zfs_arc_dnode_limit_percent:10
russ@pth-hs1:~$ cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 4608 41705217416 3820892427321077
name type data
hits 4 7413219043
misses 4 5467855407
demand_data_hits 4 0
demand_data_misses 4 0
demand_metadata_hits 4 7362177240
demand_metadata_misses 4 5453086223
prefetch_data_hits 4 0
prefetch_data_misses 4 0
prefetch_metadata_hits 4 51041803
prefetch_metadata_misses 4 14769184
mru_hits 4 165815182
mru_ghost_hits 4 1446761
mfu_hits 4 7216088077
mfu_ghost_hits 4 4321466
deleted 4 112910070
mutex_miss 4 40132
access_skip 4 3387
evict_skip 4 15984980671
evict_not_enough 4 22139723
evict_l2_cached 4 0
evict_l2_eligible 4 948597889024
evict_l2_ineligible 4 1871813890048
evict_l2_skip 4 0
hash_elements 4 428907
hash_elements_max 4 2104109
hash_collisions 4 43345413
hash_chains 4 20493
hash_chain_max 4 7
p 4 16200329216
c 4 16847693824
c_min 4 1052980864
c_max 4 16847693824
size 4 16847639536
compressed_size 4 16577940480
uncompressed_size 4 53259091456
overhead_size 4 114052608
hdr_size 4 141165920
data_size 4 0
metadata_size 4 16691993088
dbuf_size 4 3742544
dnode_size 4 8716864
bonus_size 4 2021120
anon_size 4 296960
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 31092736
mru_evictable_data 4 0
mru_evictable_metadata 4 27729920
mru_ghost_size 4 66322432
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 66322432
mfu_size 4 16660603392
mfu_evictable_data 4 0
mfu_evictable_metadata 4 16499196928
mfu_ghost_size 4 21757952
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 21757952
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
memory_all_bytes 4 33695387648
memory_free_bytes 4 13351628800
memory_available_bytes 3 12825141248
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 67235209
arc_meta_used 4 16847639536
arc_meta_limit 4 16847693824
arc_dnode_limit 4 1684769382
arc_meta_max 4 16913555592
arc_meta_min 4 16777216
sync_wait_for_async 4 57
demand_hit_predictive_prefetch 4 2989
arc_need_free 4 0
arc_sys_free 4 526490432
arc_prune
settled down after a few hours and CPU usage is now back to normal levels. Given the timing I'm unclear if bumping zfs_arc_limit_percent
to 100% was the difference maker, or if I just needed to wait for a few more hours?
Below is a crude view of the incident from a CPU load perspective across all cpus, where the middle marker is when I was bumping zfs_arc_limit_percent
up, ultimately leaving it at 100%:
I don't know enough about ARC performance to know why metadata would be such a significant fraction, nor whether it is a long-term problem to move it from the default zfs_arc_limit_percent
value of 75% to 100%.
maybe related/duplicate: https://github.com/openzfs/zfs/issues/10119
For what it's worth, whenever I run a workload on my system that I know will peg the arc_prune
threads, I run this in another terminal:
while true ; do zpool iostat 10 12 ; echo 2 > /proc/sys/vm/drop_caches ; done
It's blunt and brutal but stops those arc_prune
threads from spending 50% of the cpus trying to acquire spinlocks. It might save others from reboots, anyway.
Thanks
not for me, i already tried that then eventually it still comes to system hang
might be related to https://github.com/openzfs/zfs/issues/7559
We experience recently the very same issue.
arc_prune
and arc_reclaim
consumes a lots of CPU and all cores.
We can see this issue only a single machine. The biggest difference is that this is an AMD Epyc with a lots of cores:
AMD EPYC 7452 32-Core Processor
.
Also it's worth to mention, this is an SSD only pool with no log and cache device:
pool: tank
state: ONLINE
scan: scrub repaired 0B in 214h25m with 0 errors on Mon Aug 10 20:25:45 2020
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-INTEL_SSDSC2KB038T8_PHYF947405A33P8EGN ONLINE 0 0 0
ata-INTEL_SSDSC2KB038T8_PHYF947403PT3P8EGN ONLINE 0 0 0
errors: No known data errors
And what is also very interesting, that according to the monitoring system it started to happen only a couple of weeks ago, the machines was working properly for weeks before.
I also tried to increase arc_meta_limit
with no success:
# cat /proc/spl/kstat/zfs/arcstats|grep arc_meta
arc_meta_used 4 21485281368
arc_meta_limit 4 16106127360
arc_meta_max 4 21558749936
arc_meta_min 4 16777216
# echo 21558749936 > /sys/module/zfs/parameters/zfs_arc_meta_limit
# cat /proc/spl/kstat/zfs/arcstats|grep arc_meta
arc_meta_used 4 21530789416
arc_meta_limit 4 16106127360
arc_meta_max 4 21558749936
arc_meta_min 4 16777216
BTW echo 3 > /proc/sys/vm/drop_caches
also heals the machine for a short time.
Why is that?
In addition it's unclear to me.
Will zfs_arc_meta_limit_percent=100
fix it or is it also just a temporary solution?
@behlendorf sorry to ping you, but I read all the referenced issues without fully understand some important things:
- why does arc_meta grow so much?
- why can not arc_prune really reclaim sufficient memory (ie: it does no forward progress)?
- what pinned buffers are and why can not be reclaimed by arc_reclaim?
- what kind of workload can trigger the described issue?
I have yet to see a 100% arc_prune and/or arc_reclaim thread on my systems, but I would like to be prepared for the worse...
Thanks.
On Sun, Aug 30, 2020 at 05:01:23AM -0700, shodanshok wrote:
- what kind of workload can trigger the described issue?
In my case, I've got roughly one terabyte of unpacked source code on a system with only 128 gigabytes of memory and I want to use ripgrep to search through all the files.
I realize this is a silly workload.
Because it runs faster with a loop like this running at the same time I have some hope that the situation can improve with the clever application of programming:
while true ; do sleep 300 ; echo 2 > /proc/sys/vm/drop_caches ; done
Thanks
Possibly related: #10331 - fix dnode eviction typo in arc_evict_state() #10508 - Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10563 - dbuf cache size is 1/32nd what was intended #10600 - Revise ARC shrinker algorithm #10610 - Limit dbuf cache sizes based only on ARC target size by default #10618 - Restore ARC MFU/MRU pressure
I have a PowerEdge R730 experiencing this problem since migrating to Ubuntu 20.04. The problem didn't occur with the very same pool under CentOS 8. It's a Veeam Backup Repository and the problem occurs daily causing connections to timeout and the respective backups to fail. My current solution is daily reboots.
This data was gathered after today's reboot. I'll gather again tomorrow while the problem exists. $ cat /sys/module/zfs/version 0.8.3-1ubuntu12.4 $ grep -H . /sys/module/zfs/parameters/zfs_arc_{meta_limit{,_percent},dnode_limit{,_percent}} /sys/module/zfs/parameters/zfs_arc_meta_limit:0 /sys/module/zfs/parameters/zfs_arc_meta_limit_percent:75 /sys/module/zfs/parameters/zfs_arc_dnode_limit:0 /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent:10 $ grep dnode /proc/spl/kstat/zfs/arcstats dnode_size 4 14965008 arc_dnode_limit 4 2529106329