ARC memory overhead not accounted
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Rocky Linux |
| Distribution Version | 9.4 |
| Kernel Version | 5.14.0-427.22.1.el9_4.x86_64 |
| Architecture | x86_64 |
| OpenZFS Version | 2.2.6-1 |
Describe the problem you're observing
When doing stat on many files, available memory shrink much faster than the corresponding ARC grow. For example:
# create test dataset with 100k files
zfs destroy tank/fsmark; zfs create tank/fsmark -o compression=lz4 -o xattr=off
fs_mark -k -S0 -D10 -N1000 -n 100000 -d /tank/fsmark/
# reset ARC via export/import
zpool export tank; zpool import tank
# get initial ARC statistics via "arcstat 1"
time read ddread ddh% dmread dmh% pread ph% size c avail
15:33:37 0 0 0 0 0 0 0 5.7M 1.7G 3.1G
# use find to read inode metadata (ie: stat)
find /tank/fsmark/ -ctime -1 | wc -l
# when done, check ARC statistics again - notice how ARC increased by about 700M but avail decreased by 1.3G
time read ddread ddh% dmread dmh% pread ph% size c avail
15:33:47 416K 0 0 413K 99 2.4K 13 706M 1.7G 1.8G
# arc_summary show ARC at ~700M, with no accounting for 600M (1.3G - 700M) which seem "lost"
ARC size (current): 38.6 % 707.2 MiB
# slabtop --sort=c shows the following
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
300048 300048 100% 1.13K 21432 14 342912K zfs_znode_cache
300216 300216 100% 0.96K 37527 8 300216K dnode_t
10296 10296 100% 16.00K 5148 2 164736K zio_buf_comb_16384
302992 302981 99% 0.50K 37874 8 151496K kmalloc-512
310540 310540 100% 0.38K 31054 10 124216K dmu_buf_impl_t
300045 300045 100% 0.26K 20003 15 80012K sa_cache
9432 9432 100% 8.00K 2358 4 75456K kmalloc-8k
124026 124026 100% 0.19K 5906 21 23624K dentry
324160 324160 100% 0.06K 5065 64 20260K lsm_inode_cache
20364 19919 97% 0.65K 1697 12 13576K inode_cache
Describe how to reproduce the problem
stat many files and see how available memory shrink much faster than ARC grow.
Include any warning/errors/backtraces from the system logs
None.
I am not sure it is really a problem of ZFS, at least not it alone. Each time you stat a new file, there are several structures allocated: dnode, sa, dentry, etc. ZFS already accounts dnodes (which are the biggest consumer) and their backing dbufs as part of ARC. Please see arc_summary output in 2.3 that I recently updated to show it. I am not sure whether sa's are accounted to ARC, but that may need thinking indeed. dentry and some others are allocated by Linux kernel and out of ZFS scope, so even if everything else is perfect, memory size will still reduce.
I really think is ZFS related, as doing the same on a XFS mountpoint shows much less overhead. The difference seems massive. Example:
# create 100k on root XFS and drop caches
fs_mark -k -S0 -D10 -N1000 -n 100000 -d /opt/fsmark/
sync; echo 3 > /proc/sys/vm/drop_caches
# get mem/cache stats
free -m
total used free shared buff/cache available
Mem: 3659 348 3402 5 65 3311
Swap: 2083 6 2077
# stat files via find
find /opt/fsmark/ -ctime -1 | wc -l
# show mem/cache stats, notice how little additional memory is used (~120M)
free -m
total used free shared buff/cache available
Mem: 3659 464 3219 5 199 3195
Swap: 2083 6 2077
# slabtop --sort=c shows the following
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
100410 100410 100% 1.06K 6694 15 107104K xfs_inode
121128 121128 100% 0.19K 5768 21 23072K dentry
20328 19379 95% 0.65K 1694 12 13552K inode_cache
121088 121088 100% 0.06K 1892 64 7568K lsm_inode_cache
113472 113456 99% 0.06K 1773 64 7092K kmalloc-64
100224 100224 100% 0.06K 1566 64 6264K kmalloc-rcl-64
28384 28354 99% 0.12K 887 32 3548K kernfs_node_cache
44160 44160 100% 0.06K 690 64 2760K ebitmap_node
536 481 89% 4.00K 67 8 2144K kmalloc-4k
90780 90780 100% 0.02K 534 170 2136K avtab_node
Well, I just noticed a surprising behavior: xattr=off behaves the same as xattr=on (the directory-based implementation). So the tests I reported on the first post really apply to xattr=on. Is that expected behavior? I thought xattr=off (with the corresponding mount option noxattr) to be the highest performing mode.
With xattr=sa ZFS shows much better results: the same 100k file stats shows ARC at ~250M with a decrease in available memory of ~400M. More details:
cat arcstats | grep size | grep -v l2
size 4 257063224
compressed_size 4 22286848
uncompressed_size 4 73855488
overhead_size 4 62395904
hdr_size 4 1105680
data_size 4 512
metadata_size 4 84682240
dbuf_size 4 40656672
dnode_size 4 98568264
bonus_size 4 32038080
anon_size 4 0
mru_size 4 82289152
mru_ghost_size 4 0
mfu_size 4 2393600
mfu_ghost_size 4 0
uncached_size 4 0
arc_raw_size 4 0
abd_chunk_waste_size 4 11776
slabtop --sort=c
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
100030 100030 100% 1.13K 7145 14 114320K zfs_znode_cache
100176 100176 100% 0.96K 12522 8 100176K dnode_t
3582 3567 99% 16.00K 1791 2 57312K zio_buf_comb_16384
102896 102880 99% 0.50K 12862 8 51448K kmalloc-512
103730 103730 100% 0.38K 10373 10 41492K dmu_buf_impl_t
100020 100020 100% 0.26K 6668 15 26672K sa_cache
3180 3180 100% 8.00K 795 4 25440K kmalloc-8k
201664 201664 100% 0.12K 6302 32 25208K kmalloc-128
121590 121590 100% 0.19K 5790 21 23160K dentry
20268 19490 96% 0.65K 1689 12 13512K inode_cache
Comparing slabtop output between ZFS and XFS, dnode_t is very similar (in size) to xfs_inode. On ZFS side, I see zfs_znode_cache consuming 100M alone, then we have the various bufs (zio_buf_comb, dmu_buf_impl_t) but these are expected.