zfs ARC size "exploding" beyond set limit upon heavy NFS traffic, triggering OOM and killing system

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	10.10
Linux Kernel	4.19.0-17-amd64 #1 SMP Debian 4.19.194-1 (2021-06-10)
Architecture	amd64
ZFS Version	zfs-2.0.3-8~bpo10+1
SPL Version	2.0.3-8~bpo10+1

Describe the problem you're observing

On several of our NFS servers, we see ARC size exceeding zfs_arc_max and growing until OOM kills the system. Sometimes this takes days, sometimes less then an hour and sometimes it seems to regulate itself. Initially, we set zfs_arc_max to 32GByte (echo $((2**35)) | tee /sys/module/zfs/parameters/zfs_arc_max) and arcstat confirms the size to be 32G (out of 192GByte RAM).

After prolonged, inefficient random reads by 1000s of NFS clients at about 400MByte/s leading to an average 740MByte/s read from the pool's 8 vdevs with 3 disks each in raidz configuration:

operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
srv         82.1T  47.9T  3.39K    115   745M  1.02M
srv         82.1T  47.9T  3.60K    113   741M   750K
srv         82.1T  47.9T  3.50K    138   708M   886K

Initially:

arcstat 5   
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
11:26:46    33     4     11     0    0     3   73     0    0   32G   32G   144G
11:26:51   30K  6.3K     20   632    2  5.7K   84    50    0   32G   32G   143G
11:26:56   26K  5.4K     20   467    2  4.9K   83    51    0   31G   32G   143G
11:27:01   31K  6.3K     20   595    2  5.7K   84    50    0   32G   32G   143G
11:27:06   31K  6.6K     20   592    2  6.0K   85    83    0   31G   32G   142G
11:27:11   29K  6.2K     20   569    2  5.6K   85    58    0   32G   32G   142G
11:27:16   28K  6.1K     21   585    2  5.5K   85    49    0   32G   32G   142G

after some time

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
11:39:27   30K  6.0K     19   602    2  5.3K   82   108    0  113G   32G   867M
11:39:32   31K  5.9K     18   529    2  5.3K   80   127    0  114G   32G    89M
11:39:37   31K  5.9K     18   565    2  5.3K   81   116    0  114G   29G  -653M
11:39:42   32K  6.1K     18   595    2  5.5K   80   107    0  114G   23G  -1.1G
11:39:47   32K  6.1K     18   569    2  5.5K   81   110    0  114G   16G  -1.8G
11:39:52   31K  6.0K     18   635    2  5.3K   81   113    0  115G  5.9G  -2.2G
11:39:57   31K  5.8K     18   617    2  5.2K   81   119    0  115G  5.9G  -3.2G
11:40:02   30K  5.4K     17   633    2  4.8K   79   117    0  116G  5.9G  -4.0G
11:40:07   31K  5.7K     18   666    2  5.0K   79   121    0  116G  5.9G  -4.8G
11:40:12   30K  5.5K     17   642    2  4.8K   79   125    0  117G  5.9G  -5.4G
11:42:11   37K  5.7K     15   612    2  5.1K   58   441    1  117G  5.9G  -5.5G

(adding waste column does not really yield much, it always stays below the 100kByte level)

Describe how to reproduce the problem

Hit a NFS (v3) server hard by 100-1000s of clients reading data off disks (presumably in small random bits).

Include any warning/errors/backtraces from the system logs

As the system dies with OOM killing a "random" process (e.g. agetty), not much output was found so far.

echo 1 > /proc/sys/vm/drop_caches does nothing, echo 2 > /proc/sys/vm/drop_caches does cause a brief noticeable effect:

13:03:56   41K  5.7K     13   181    0  5.5K   69    45    0   70G   31G    69G    22K
13:04:11   40K  5.5K     13   167    0  5.3K   68   157    0   71G   31G    68G    50K
13:04:26  4.3M  1.5K      0    46    0  1.4K   67    33    0   70G   31G    69G    63K
13:04:41  5.7M   138      0     2    0   136   72     5    0   69G   32G    70G    61K
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail  waste
13:04:56  5.8M   195      0     5    0   190   73     8    0   70G   32G    69G    59K
13:05:11  5.1M  1.0K      0    23    0  1.0K   70    48    0   68G   30G    71G    50K
13:05:26   42K  5.9K     13   161    0  5.7K   69   228    0   68G   29G    70G    33K
13:05:41   44K  6.0K     13   184    0  5.8K   66   142    0   69G   29G    70G    26K

but later-on while still running/blocking the situation grows more and more dire:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail  waste
13:16:27   40K  6.0K     14   302    0  5.6K   68   182    0   74G  10.0G    62G    44K
13:16:42   40K  6.1K     15   318    0  5.8K   70   215    0   75G  9.5G    60G    27K
13:16:57   39K  6.0K     15   317    1  5.7K   70   213    0   76G  9.0G    59G    23K
13:17:12   40K  6.0K     14   331    1  5.7K   69   206    0   77G  8.6G    57G    24K
13:17:27   42K  5.9K     13   334    0  5.6K   71   183    0   78G  8.3G    55G    21K

I am still investigating but so far the only "cure" found was restarting the NFS server which seems to let a lot of cache space "expire".

That, happy for any hint what I could do to mitigate/get around this problem.

Jul 12 '21 13:07 carsten-AEI

If you look at the arcstats, what is the size of the 'other' portion of the ARC?

It may be that you have a very large amount of dirty data, because your demand read rate is so high, and the writes are never having a chance to flush, and it is building up dirty data. The write throttle should eventually slow the incoming rate of writes to resolve this, but, it is something to look at.

Jul 12 '21 14:07 allanjude

With ZFS+NFS I have always had to limit ARC to far below the actual system memory, to account for ZFS taking up RAM from unknown slab allocations which it never releases.

Jul 12 '21 16:07 gdevenyi

Possible related, #10302

Jul 12 '21 16:07 gdevenyi

If you look at the arcstats, what is the size of the 'other' portion of the ARC?

Does 'other' refer to another "column" from the arcstat command or entries from /proc/spl/kstat/zfs/arcstats?

I am currently creating test data on an otherwise unused NFS box to perform some tests there without interfering with other users' jobs. This will take a bit to be set-up but hopefully I will be able to gather some data from /proc/spl/kstat/zfs/arcstats, /proc/spl/kmem/slab and /proc/slabinfo. Any other metrics I should try to gather?

Jul 13 '21 07:07 carsten-AEI

Possible related, #10302

Yeah, could be. As waste is always very small (ashift 12, default 128k recordsize, compression enabled but does not look to be too effective as the ratio currently is 1.00 for that user), I doubt we hit what PR #10701 addressed. Let's see if I can generate enough read-cache-pressure to "reliably" generate a test case.

Jul 13 '21 07:07 carsten-AEI

The stat I was looking for was the anon_size from /proc/spl/kstat/zfs/arcstats, which will be the amount of dirty data. It will change as writes come in, and are flushed one.

Jul 15 '21 15:07 allanjude

So far, all my attempts to reproduce the problem failed. I know the user had a directory with about 2 TByte of tarballs on his NFS share and ech tarball contained a few 1000 entries of either small uncompressed text files or much larger gzipped files, i.e. the individual files were compressed, the outer tar containers were not!

Each client only accessed a single tarball, opened it, scanned it for all members and selected all of these based on the file names. Then those file names were sorted and each file was read in turn from the tarball, uncompressed to the local file system and then merged into a HDF5 file.

As the files within the tarball are not sorted by name, sorting the file names would obviously lead to random access of the tarball, but given all of this happened within a Python script and Python's tarfile module seems to keep track of positions where each tar-member starts, that should not really be a too harsh on the file system.

But so far, I was unable to generate enough "pressure" or whatever was needed to ensure the ARC cache barrier breaks done on my partial cop of the user's data. The user's job was safe with 25 clients but caused problems with 250 clients running simultaneously. So far I tried it with up to 800 clients but the system just maxes out at about 60k reads and 15% cache misses

13:17:55   69K   10K     15   532    0  9.9K   76    19    0   14G   14G    10G
13:18:00   65K   10K     15   618    1  9.4K   77    14    0   14G   14G    10G
13:18:05   66K   10K     15   568    1  9.7K   77    19    0   14G   14G    10G
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:18:10   66K   10K     15   548    1  9.5K   75    10    0   14G   14G    10G
13:18:15   64K  10.0K     15   613    1  9.4K   77    17    0   14G   14G    10G
13:18:20   64K   10K     15   596    1  9.6K   77    13    0   14G   14G    10G
13:18:25   62K  9.9K     15   643    1  9.3K   76     9    0   14G   14G    10G

(please note for this test, I reduced the available RAM via the kernel's command line as i hoped to be able to trigger the problem more easily with less RAM).

dstat shows the system being quite busy, but not exceedingly so:

----system---- --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system-- ---load-avg--- ------memory-usage----- ----swap---
     time     |usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw | 1m   5m  15m | used  free  buff  cach| used  free
19-07 13:18:06|  4  20   0  76   0|1249M  381k|3916k 1961M|   0     0 |  67k   76k| 297  124  166|17.2G 11.6G 2132k  426M|   0     0 
19-07 13:18:16|  4  19   0  76   0|1270M  363k|3911k 1958M|   0     0 |  67k   77k| 311  133  168|17.2G 11.6G 2132k  426M|   0     0 
19-07 13:18:26|  4  19   0  77   0|1249M  371k|3871k 1915M|   0     0 |  66k   76k| 322  141  170|17.1G 11.8G 2132k  426M|   0     0 
19-07 13:18:36|  4  19   0  77   0|1225M  375k|3795k 1858M|   0     0 |  65k   74k| 332  149  173|17.0G 11.8G 2132k  426M|   0     0

Any ideas which kind of scenario I should/could try?

Sample test script

Jul 19 '21 13:07 carsten-AEI

If you could print out /proc/spl/kstat/zfs/arcstats when it happens, that will be great for debug.

Jul 21 '21 15:07 jxdking

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Jul 31 '22 01:07 stale[bot]

zfs zfs copied to clipboard

ARC size "exploding" beyond set limit upon heavy NFS traffic, triggering OOM and killing system

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard