zfs arc metadata exhaustion handling in linux and freebsd with large number of datasets

trafficstars

System information

Type	Version/Name
Distribution Name	Ubuntu / FreeBSD
Distribution Version	22.04.1 LTS / 13.1-RELEASE
Kernel Version	5.15.0-52-generic / 13.1-n250148-fc952ac2212
Architecture	amd64 / amd64
OpenZFS Version	zfs-2.1.6-0york1~22.04 / zfs-2.1.4-FreeBSD_g52bad4f23

Describe the problem you're observing

When creating and mounting a large number of datasets, thousands to tens of thousands, there are some differences in arc handling between linux and freebsd.

Linux observed behaviours: As datasets are created and mounted some information about them is stored in the arc's metadata cache and dnode cache. After enough datasets are created eventually the high water mark for metadata is reached. This causes zfs to start pruning arc to remove unneeded records, but it appears that the metadata records for mounted datasets must always be in arc. This then causes contention and a arc_prune will consume all cpu resources trying to remove records that are constantly being readded.

Freebsd observed behaviours: As datasets are created and mounted some information about the datasets are stored in arc's metadata cache and dnode cache. When enough datasets are created to cross the metadata high water mark threshold and arc pruning starts. Instead of removing only one record are a small number of records at a time the metadata cache is instead flushed freeing up space but not actively refilling the metadata cache with the already mounted datasets. This allows zfs to continue on without arc_prune consuming 100% of the cpu.

The question here is can linux be made to act like freebsd? Can caching of the datasest be left until it is accessed?
Or being able to free up larger portions of the arc metadata at a time?

This would seem appropriate when doing bulk creations, is also helpful on import to try not to cache everything as it might not be needed right away.

As a note the amount of data on the zpool doesn't matter. This will occur with a blank pool just creating datasets.

Describe how to reproduce the problem

Create a blank pool and start creating datasets. Depending on the size of the system the number of datasets needed to exhaust arc metadata differs. On a machine with 8gb of memory linux started to fail at around 5K using the default parameters Freebsd was able to create over 80K datasets without issues.

Requires bash

#!/bin/bash
for x in {1..10000}
do
        echo "creating $x" | tee -a  log.txt
        zfs create tank/$x
done

Check arc_summary -s arc for usage stats. The easiest I found was to have arc_summary and htop with zfs modules loaded while creating the datasets

I.E. watch arc_simmary -s arc linux gnu-watch arc_summary -s arc freebsd

Include any warning/errors/backtraces from the system logs

Nov 29 '22 17:11 manfromafar

Too try to help clarify some of the issues. I've ran some tests to gather data about the differences. The tests were ran on two machines with 8GB of ram. The freebsd machine used the stock settings for zfs on freebsd 13.1 The linux machine is running openzfs 2.1.6. Not the version shipped by canonical but the one from jonathonf's [repository ]. (https://launchpad.net/~jonathonf/+archive/ubuntu/zfs) The only important differences from stock settings are setting the max metadata and dnode limit. I had forgotten to turn them off before running the tests, but they don't affect much as it just increases the number of datasets that can be created. Custom settings for linux:

#Increase amount of arc space dnode entries can use
options zfs zfs_arc_dnode_limit_percent=90

#Increase amount of arc space for metadata
options zfs zfs_arc_meta_limit_percent=90

#Decrease amount of space reserved for root opertations
#Since we have 20TB of disks we don't need 645GB of reserved space 156GB is fine.
options zfs spa_slop_shift=8

#Allow ZED to have a larger buffer for messages
options zfs zfs_zevent_len_max=50000

#Disable deffered resilver
options zfs zfs_resilver_disable_defer=1

Freebsd stats to know: Stats to know: Max size (high water): 27:1 7.0 GiB Metadata cache size (hard limit): 75.0 % 5.2 GiB On freebsd the default metadata hard limit is 75% of current arc's total size. In freebsd's case this is total ram - 1GB

The first graph is of freebsd's metadata cache size when creating datasets

Freebsd metadata cache size when creating datasets

As can be seen from the graph once the metadata cache reaches around 115% capacity it is flushed and slowly fills backup when new datasets are created. I stopped the test at around 27K datasets as I had gotten enough data to show the trend.

Linux stats to know: Max size (high water): 16:1 3.9 GiB Metadata cache size (hard limit): 90.0 % 3.5 GiB Linux defaults to 50% of ram for arc The metadata hard limit is modified to be 90% instead of the normal 75%

The second graph is of linux's metadata cache when creating datasets.

Linux metadata cache size when creating datasets

The graph abruptly ends at the 5600 mark as zfs was no longer able to create any new datasets since arc_prune and arc_evict were trying to clean up the metadata cache to get under the hard limit.

Linux arc_prune arc_evict cpu usage linux_arc_exhausted_prune_evict_cpu_usage

Combined graph Combined metadata cache size when creating datasets

The raw arc_summary logs during each datasets creation and svg versions graphs can be found at https://github.com/manfromafar/linux-vs-freebsd-arc-stats

Datasets were created as fast as possible. Used a simple for loop that would print dataset # created, time stamp, create the dataset, print arc_summary stats.

Linux arc_summary -s arc log with datasets created, time, and arc report https://github.com/manfromafar/linux-vs-freebsd-arc-stats/blob/main/linux/linux_arc_stats-2022-11-30-10-39.txt

Freebsd arc_summary -s arc log with datasets created, time, and arc stats https://github.com/manfromafar/linux-vs-freebsd-arc-stats/blob/main/freebsd/freebsd_arc_stats-2022-11-30-11-52.txt

Nov 30 '22 21:11 manfromafar

Have you been able to mitigate the issues under linux? The general issues around arc_prune and arc_evict getting stuck using 100% CPU is severely impacting the usability of openzfs.

Dec 07 '23 01:12 kgcgs

arc_evict and arc_prune logic was significantly rewritten in OpenZFS 2.2. It would be interesting to hear any feedback about the new behavior.

Dec 07 '23 02:12 amotin

I think I've come to the conclusion that throwing 256GB of ram at this problem is easier than other options and will most likely resolve my issues. The underlying problem seems to be that I have a absolute ton of files/directories and the pressure on the arc for metadata and dnode caching is simply too much for the available ram on the host.

2.2 by itself doesn't seem to be able to fix having insufficent memory but it does appear to handle it better. In particular, arc_prune and arc_evict don't get stuck at 100% of cpu.

Dec 07 '23 20:12 kgcgs

not sure, but maybe there is room for improvement for storing dnode&metadata in ram more efficiently? i‘m still curious , why metadata seems to be stored so much more efficient (i.e. space saving) on disk , when ram is so much more precious ressource.rolanddir/file metadata consumes enourmous amount of ARC · Issue #13925 · openzfs/zfsgithub.comVon meinem iPhone gesendetAm 07.12.2023 um 21:23 schrieb kgcgs @.***>: I think I've come to the conclusion that throwing 256GB of ram at this problem is easier than other options and will most likely resolve my issues. The underlying problem seems to be that I have a absolute ton of files/directories and the pressure on the arc for metadata and dnode caching is simply too much for the available ram on the host.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

Dec 07 '23 22:12 devZer0

arc_evict and arc_prune logic was significantly rewritten in OpenZFS 2.2. It would be interesting to hear any feedback about the new behavior.

@amotin I just built the 2.2.2 binaries using ubuntu 22.04 and tried again. The test at first appeared better I got to 9.5K datasets before exhausting memory causing kswapd0 to consume cpu.

Unforuntantly although the limit is higher now on linux a limit is still reached It looks like the limit of 9.5k datasets is directly tied to arc being able to consume more memory on 2.2.X. Unless 2.2.X changes the default from 50% of ram on linux, which should limit arc to ~4GB but when zfs gives up arc reaches about ~6GB of usage.

arc summary before memory exhaustion

creating 9510  2023-12-07

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Dec 07  2023
Linux 5.15.0-89-generic                                          2.2.2-1
Machine: ubuntushiftfstest (x86_64)                              2.2.2-1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   163.7 %    6.3 GiB
        Target size (adaptive):                         6.2 %  248.0 MiB
        Min size (hard limit):                          6.2 %  248.0 MiB
        Max size (high water):                           16:1    3.9 GiB
        Anonymous data size:                            0.0 %    0 Bytes
        Anonymous metadata size:                        0.0 %    0 Bytes
        MFU data target:                               37.1 %    2.3 GiB
        MFU data size:                                  0.0 %    0 Bytes
        MFU ghost data size:                                     0 Bytes
        MFU metadata target:                           12.4 %  776.8 MiB
        MFU metadata size:                             41.2 %    2.5 GiB
        MFU ghost metadata size:                               187.9 MiB
        MRU data target:                               37.1 %    2.3 GiB
        MRU data size:                                  0.0 %    0 Bytes
        MRU ghost data size:                                     0 Bytes
        MRU metadata target:                           13.3 %  830.4 MiB
        MRU metadata size:                             58.8 %    3.6 GiB
        MRU ghost metadata size:                               157.2 MiB
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes
        Bonus size:                                     0.4 %   23.8 MiB
        Dnode cache target:                            90.0 %    3.5 GiB
        Dnode cache size:                               4.3 %  151.9 MiB
        Dbuf size:                                      0.6 %   40.4 MiB
        Header size:                                    0.4 %   25.9 MiB
        L2 header size:                                 0.0 %    0 Bytes
        ABD chunk waste size:                           0.1 %    5.0 MiB

ARC hash breakdown:
        Elements max:                                             103.7k
        Elements current:                             100.0 %     103.7k
        Collisions:                                                31.8k
        Chain max:                                                     3
        Chains:                                                     4.8k

ARC misc:
        Deleted:                                                   13.6k
        Mutex misses:                                                  3
        Eviction skips:                                                0
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                   3.7 GiB
        L2 eligible MFU evictions:                      6.3 %  234.0 MiB
        L2 eligible MRU evictions:                     93.7 %    3.4 GiB
        L2 ineligible evictions:                                 0 Bytes

After rebooting and importing the pool memory usage spikes back up almost full but the server is more responsive. If I try to run zfs list arc_prune and arc_evict show back up consuming cpu but at least the list does return and once finished arc_prune and arc_evict stop consuming cpu

Dec 07 '23 23:12 manfromafar

The pool here is looking like this. I think the biggest thing here was just to allow the dnode cache to get this large and to set primarycache to metadata only. (Read performance on the pool is not a problem and it's all backed by SSDs anyway.)

    Dnode cache target:                            50.0 %    4.0 GiB
    Dnode cache size:                              77.6 %    3.1 GiB

Dec 07 '23 23:12 kgcgs

zfs zfs copied to clipboard

arc metadata exhaustion handling in linux and freebsd with large number of datasets

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard