zfs Several improvements to ARC shrinking

Motivation and Context

Since same time updating to Linux 6.6 kernel and increasing maximum ARC size in TrueNAS SCALE 24.04, we've started to receive multiple complains from people on excessive swapping, making systems unresponsive. While I attribute significant part of the problem to the new Multi-Gen LRU code enabled in 6.6 kernel (disabling it helps), I ended up with this set of smaller tunings to ZFS side, trying to make it a bit nicer in this terrible environment.

Description

When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested.
On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting.
On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems.
Slightly update Linux arc_set_sys_free() math for new kernels.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[ ] I have read the contributing document.
[ ] I have added tests to cover my changes.
[ ] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

May 14 '24 15:05 amotin

FWIW I think there's yet another possible source for excessive swapping in addition to your observations - it might be caused by too high zfs_abd_scatter_max_order. In our setup, it only takes a few days until excessive reclaim kicks in, then we have to add a zram-based swap device. When we lower zfs_abd_scatter_max_order to below 3, the excessive reclaim doesn't of course disappear fully, as there are other sources of pressure in the kernel to get higher order buddies, but it is very noticeable (load drops by 100 on a machine with 600 1st level containers and tons more nested).

In our situation, since we run with txg_timeout = 15 and pretty high dirty_data_max so that we really mostly sync on the 15s mark, it's those syncs that trigger a lot of paging out to swap. Using zram has so far mitigated it as we tend to have at least 100G+ free memory, but it's easily available only in 4k chunks...

May 16 '24 11:05 snajpa

@snajpa Yes, I was also thinking about zfs_abd_scatter_max_order. I don't have own numbers, but my thinking was that on FreeBSD, where ARC allocates only individual PAGE_SIZE pages, it takes from OS the least convenient memory, while on Linux ARC always allocates the best contiguous chunks it can, that leaves other subsystems that are more sensitive to fragmentation to suffer. Contiguous chunks should be good for I/O efficiency, and on FreeBSD I do measure some per-page overheads, but there must be some sweet spot.

May 16 '24 14:05 amotin

@amotin I haven't looked at the code yet, but if it doesn't do it already, it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

we could also optimize further by saving the last successful order :) and only sometimes (whatever that means for now) go for a higher order

May 16 '24 20:05 snajpa

it might be worth allocating the memory with flags so it doesn't trigger any reclaim at all and then decrement the requested order on fail

That is what ZFS does. It tries to allocate big first, but if fails, requests smaller and smaller until get enough. But that way it consumes all remaining big chunks first.

May 16 '24 20:05 amotin

It actually seems to directly call kvmalloc() when HAVE_KVMALLOC. In the 6.8 source I'm looking at, kvmalloc seems to do __GFP_NORETRY, for which the documentation says it does one round of reclaim in this implementation. I'm tempted to change that line to kmalloc_flags &= ~__GFP_DIRECT_RECLAIM; to see what happens :D not sure what to do (if anything) on ZFS level with this information though.

May 16 '24 22:05 snajpa

@snajpa Most of ARC capacity is allocated by abd_alloc_chunks() via alloc_pages_node().

May 17 '24 15:05 amotin

I've tried bpftraceing spl_kvmalloc calls and it seems at least dsl_dir_tempreserve_space and dmu_buf_hold_array_by_dnode are calling spl_kvmalloc (which ends up with one round of reclaim). Running this on a comparatively pretty much idle staging node, yet it's IMHO way too many calls in too little time...

[[email protected]]
 ~ # timeout --foreground 10 bpftrace -e 'kprobe:spl_kvmalloc{ printf("%s: %s(%d)\n", probe, comm, pid); }' | wc -l
231947

Interestingly, it seems to be called to get always pretty similar amounts of memory - ranging from 273408 to 273856 bytes (?)

May 17 '24 17:05 snajpa

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Thanks.

Jun 28 '24 07:06 shodanshok

zfs_arc_shrinker_limit=10000 (default) seems to strongly favor ARC, forcing heavy swapping even when not needed. Adjusting vm.swappiness has only limited effect (unless setting it to 0).

I do plan to set it to 0 in our TrueNAS builds, since we control kernel there. But I have no good ideas what to do about upstream, since some Linux kernels tend to request enormous eviction amounts, even though original motivation of its additions should no longer apply to most users. The 10000 default IMHO is extremely low, if any value other than 0 there is correct at all. But I am not touching it for this patch, leaving for later.

Does this patch address this issue? Can the fix be implemented within the limit of zfs_arc_shrinker_limit, zfs_arc_shrink_shift and zfs_arc_pc_percent, without introducing yet another tunable? It is becoming quite difficult to tune a system to avoid excessive swap.

This patch is not expected to fix the issue by itself, only polish some moments. As I have told, at this point we have removed MGLRU from our kernels, that helped a lot with excessive swapping, and I am going to set zfs_arc_shrinker_limit=0 and zfs_arc_pc_percent=300 to make ARC adjust better. The new tunable I've added is more for completeness, I do not insist on it and may remove if there are objections.

Side question: in general, why it is so difficult to "emulate" the behavior of linux pagecache in respect grow, reclaim and shrink?

Because page cache does not use the crippled shrinker KPIs ZFS has to use. All memory pressure in Linux is built around page cache, and everything else is secondary. And the mentioned MGLRU brings it to extreme, that is why we had to disable it, but it is not a long-term solution.

Jun 28 '24 13:06 amotin

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding https://github.com/openzfs/zfs/pull/16313#issuecomment-2198551588 ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

Jul 05 '24 18:07 shodanshok

As this patch touches zfs_arc_shrinker_limit, any thoughts regarding #16313 (comment) ? Do you feel comfortable leaving zfs_arc_shrinker_limit=10000? The default value seems too small to me.

This patch actually does nothing about zfs_arc_shrinker_limit and for a reason. While I don't like the current default, I don't see a good alternative. If I would change it, I would change it to 0 and then try to kick Linux developers to be reasonable. While ZFS uses anything other than 0, it means it does not follow kernel memory pressure requests, and in that situation I see hopeless to try and make kernel cooperate.

Jul 05 '24 21:07 amotin

After more thinking I've decided to add one more chunk to this patch. When receiving direct reclaim from file systems (that may be ZFS itself) previous code was just ignoring that request to avoid deadlocks. But if ZFS occupies most of system's RAM, ignoring such requests may cause excessive pressure on other caches and swap, and in longer run may result in OOM killer activation. Instead of ignoring the request I've made it to shrink ARC and kick eviction thread but skip the wait. It may be not perfect, but do we have a better choice?

Jul 08 '24 19:07 amotin

I've decided once more reconsider arc_is_overflowing(). Previously it made caller never wait for eviction of less than 1/512 of ARC size or SPA_MAXBLOCKSIZE (16MB), whatever is bigger. But Linux starts reclaim process from 1/4096 (see DEF_PRIORITY of 12), which means first several iterations ZFS may not timely react on memory pressure, forcing more eviction from page cache and other caches, which may already be on minimum if most of memory is consumed by ARC.

The new code uses zfs_max_recordsize as a minimum wait threshold under memory pressure, which is still 16MB on 64bit platforms, but only 1MB on 32bit, which should be nicer to the last. Not considering zfs_arc_overflow_shift under pressure allows to be more reactive on a large systems, where 1/512 of ARC may mean gigabytes of RAM, while kernel may need much less, but right now.

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

Jul 12 '24 19:07 amotin

BTW, some workarounds growing from my MGLRU complains: https://lkml.kernel.org/r/[email protected] .

Jul 12 '24 20:07 amotin

PS: Thinking more, with the current zfs_arc_shrinker_limit default of 10000 pages (that means only 40MB memory reclaim at a time under absolutely desperate pressure before OOM killer), I suppose ZFS could almost never react timely on the memory pressure on large systems. This should make it some better, while zfs_arc_shrinker_limit is still evil.

I agree. It should actually be ~160 MB (40 MB * 4 sublists), but the results does not change: under memory pressure, ARC force heavy swap and/or OOM. A more reasonable default for zfs_arc_shrinker_limit should be in the range of 128K pages, with no limit at all when direct reclaim is requested.

Jul 13 '24 07:07 shodanshok

I don't see any major issues. Could you please rebase on master to get a good FreeBSD test run?

Jul 24 '24 20:07 tonyhutter