zfs
zfs copied to clipboard
ARC size "exploding" beyond set limit upon heavy NFS traffic, triggering OOM and killing system
System information
Type | Version/Name |
---|---|
Distribution Name | Debian |
Distribution Version | 10.10 |
Linux Kernel | 4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10) |
Architecture | amd64 |
ZFS Version | zfs-2.0.3-8~bpo10+1 |
SPL Version | 2.0.3-8~bpo10+1 |
Describe the problem you're observing
On several of our NFS servers, we see ARC size exceeding zfs_arc_max and growing until OOM kills the system. Sometimes this takes days, sometimes less then an hour and sometimes it seems to regulate itself.
Initially, we set zfs_arc_max
to 32GByte (echo $((2**35)) | tee /sys/module/zfs/parameters/zfs_arc_max
) and arcstat confirms the size
to be 32G (out of 192GByte RAM).
After prolonged, inefficient random reads by 1000s of NFS clients at about 400MByte/s leading to an average 740MByte/s read from the pool's 8 vdevs with 3 disks each in raidz configuration:
operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
srv 82.1T 47.9T 3.39K 115 745M 1.02M
srv 82.1T 47.9T 3.60K 113 741M 750K
srv 82.1T 47.9T 3.50K 138 708M 886K
Initially:
arcstat 5
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
11:26:46 33 4 11 0 0 3 73 0 0 32G 32G 144G
11:26:51 30K 6.3K 20 632 2 5.7K 84 50 0 32G 32G 143G
11:26:56 26K 5.4K 20 467 2 4.9K 83 51 0 31G 32G 143G
11:27:01 31K 6.3K 20 595 2 5.7K 84 50 0 32G 32G 143G
11:27:06 31K 6.6K 20 592 2 6.0K 85 83 0 31G 32G 142G
11:27:11 29K 6.2K 20 569 2 5.6K 85 58 0 32G 32G 142G
11:27:16 28K 6.1K 21 585 2 5.5K 85 49 0 32G 32G 142G
after some time
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
11:39:27 30K 6.0K 19 602 2 5.3K 82 108 0 113G 32G 867M
11:39:32 31K 5.9K 18 529 2 5.3K 80 127 0 114G 32G 89M
11:39:37 31K 5.9K 18 565 2 5.3K 81 116 0 114G 29G -653M
11:39:42 32K 6.1K 18 595 2 5.5K 80 107 0 114G 23G -1.1G
11:39:47 32K 6.1K 18 569 2 5.5K 81 110 0 114G 16G -1.8G
11:39:52 31K 6.0K 18 635 2 5.3K 81 113 0 115G 5.9G -2.2G
11:39:57 31K 5.8K 18 617 2 5.2K 81 119 0 115G 5.9G -3.2G
11:40:02 30K 5.4K 17 633 2 4.8K 79 117 0 116G 5.9G -4.0G
11:40:07 31K 5.7K 18 666 2 5.0K 79 121 0 116G 5.9G -4.8G
11:40:12 30K 5.5K 17 642 2 4.8K 79 125 0 117G 5.9G -5.4G
11:42:11 37K 5.7K 15 612 2 5.1K 58 441 1 117G 5.9G -5.5G
(adding waste
column does not really yield much, it always stays below the 100kByte level)
Describe how to reproduce the problem
Hit a NFS (v3) server hard by 100-1000s of clients reading data off disks (presumably in small random bits).
Include any warning/errors/backtraces from the system logs
As the system dies with OOM killing a "random" process (e.g. agetty
), not much output was found so far.
echo 1 > /proc/sys/vm/drop_caches
does nothing, echo 2 > /proc/sys/vm/drop_caches
does cause a brief noticeable effect:
13:03:56 41K 5.7K 13 181 0 5.5K 69 45 0 70G 31G 69G 22K
13:04:11 40K 5.5K 13 167 0 5.3K 68 157 0 71G 31G 68G 50K
13:04:26 4.3M 1.5K 0 46 0 1.4K 67 33 0 70G 31G 69G 63K
13:04:41 5.7M 138 0 2 0 136 72 5 0 69G 32G 70G 61K
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail waste
13:04:56 5.8M 195 0 5 0 190 73 8 0 70G 32G 69G 59K
13:05:11 5.1M 1.0K 0 23 0 1.0K 70 48 0 68G 30G 71G 50K
13:05:26 42K 5.9K 13 161 0 5.7K 69 228 0 68G 29G 70G 33K
13:05:41 44K 6.0K 13 184 0 5.8K 66 142 0 69G 29G 70G 26K
but later-on while still running/blocking the situation grows more and more dire:
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail waste
13:16:27 40K 6.0K 14 302 0 5.6K 68 182 0 74G 10.0G 62G 44K
13:16:42 40K 6.1K 15 318 0 5.8K 70 215 0 75G 9.5G 60G 27K
13:16:57 39K 6.0K 15 317 1 5.7K 70 213 0 76G 9.0G 59G 23K
13:17:12 40K 6.0K 14 331 1 5.7K 69 206 0 77G 8.6G 57G 24K
13:17:27 42K 5.9K 13 334 0 5.6K 71 183 0 78G 8.3G 55G 21K
I am still investigating but so far the only "cure" found was restarting the NFS server which seems to let a lot of cache space "expire".
That, happy for any hint what I could do to mitigate/get around this problem.
If you look at the arcstats, what is the size of the 'other' portion of the ARC?
It may be that you have a very large amount of dirty data, because your demand read rate is so high, and the writes are never having a chance to flush, and it is building up dirty data. The write throttle should eventually slow the incoming rate of writes to resolve this, but, it is something to look at.
With ZFS+NFS I have always had to limit ARC to far below the actual system memory, to account for ZFS taking up RAM from unknown slab allocations which it never releases.
Possible related, #10302
If you look at the arcstats, what is the size of the 'other' portion of the ARC?
Does 'other' refer to another "column" from the arcstat
command or entries from /proc/spl/kstat/zfs/arcstats
?
I am currently creating test data on an otherwise unused NFS box to perform some tests there without interfering with other users' jobs. This will take a bit to be set-up but hopefully I will be able to gather some data from /proc/spl/kstat/zfs/arcstats
, /proc/spl/kmem/slab
and /proc/slabinfo
. Any other metrics I should try to gather?
Possible related, #10302
Yeah, could be. As waste
is always very small (ashift 12, default 128k recordsize, compression enabled but does not look to be too effective as the ratio currently is 1.00 for that user), I doubt we hit what PR #10701 addressed. Let's see if I can generate enough read-cache-pressure to "reliably" generate a test case.
The stat I was looking for was the anon_size
from /proc/spl/kstat/zfs/arcstats
, which will be the amount of dirty data. It will change as writes come in, and are flushed one.
So far, all my attempts to reproduce the problem failed. I know the user had a directory with about 2 TByte of tarballs on his NFS share and ech tarball contained a few 1000 entries of either small uncompressed text files or much larger gzipped files, i.e. the individual files were compressed, the outer tar
containers were not!
Each client only accessed a single tarball, opened it, scanned it for all members and selected all of these based on the file names. Then those file names were sorted and each file was read in turn from the tarball, uncompressed to the local file system and then merged into a HDF5 file.
As the files within the tarball are not sorted by name, sorting the file names would obviously lead to random access of the tarball, but given all of this happened within a Python script and Python's tarfile
module seems to keep track of positions where each tar-member starts, that should not really be a too harsh on the file system.
But so far, I was unable to generate enough "pressure" or whatever was needed to ensure the ARC cache barrier breaks done on my partial cop of the user's data. The user's job was safe with 25 clients but caused problems with 250 clients running simultaneously. So far I tried it with up to 800 clients but the system just maxes out at about 60k reads and 15% cache misses
13:17:55 69K 10K 15 532 0 9.9K 76 19 0 14G 14G 10G
13:18:00 65K 10K 15 618 1 9.4K 77 14 0 14G 14G 10G
13:18:05 66K 10K 15 568 1 9.7K 77 19 0 14G 14G 10G
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
13:18:10 66K 10K 15 548 1 9.5K 75 10 0 14G 14G 10G
13:18:15 64K 10.0K 15 613 1 9.4K 77 17 0 14G 14G 10G
13:18:20 64K 10K 15 596 1 9.6K 77 13 0 14G 14G 10G
13:18:25 62K 9.9K 15 643 1 9.3K 76 9 0 14G 14G 10G
(please note for this test, I reduced the available RAM via the kernel's command line as i hoped to be able to trigger the problem more easily with less RAM).
dstat
shows the system being quite busy, but not exceedingly so:
----system---- --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- ---load-avg--- ------memory-usage----- ----swap---
time |usr sys idl wai stl| read writ| recv send| in out | int csw | 1m 5m 15m | used free buff cach| used free
19-07 13:18:06| 4 20 0 76 0|1249M 381k|3916k 1961M| 0 0 | 67k 76k| 297 124 166|17.2G 11.6G 2132k 426M| 0 0
19-07 13:18:16| 4 19 0 76 0|1270M 363k|3911k 1958M| 0 0 | 67k 77k| 311 133 168|17.2G 11.6G 2132k 426M| 0 0
19-07 13:18:26| 4 19 0 77 0|1249M 371k|3871k 1915M| 0 0 | 66k 76k| 322 141 170|17.1G 11.8G 2132k 426M| 0 0
19-07 13:18:36| 4 19 0 77 0|1225M 375k|3795k 1858M| 0 0 | 65k 74k| 332 149 173|17.0G 11.8G 2132k 426M| 0 0
Any ideas which kind of scenario I should/could try?
If you could print out /proc/spl/kstat/zfs/arcstats when it happens, that will be great for debug.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.