netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

Huge memory consumption in chunk-cache after handling several opened netcdf-files with default-chunk-cache configuration

Open heikoklein opened this issue 9 months ago • 1 comments

When using the default chunk-cache settings and opening several simultaneously, the netcdf-library create a permanent memory-consumption of up to 64MB per file and variable even after closing all files (approx. 1.6GB of chunk-cache for 5 files with 5 variables).

The problem can be circumvented by using any chunk-cache size different from the default. It might be related to https://docs.unidata.ucar.edu/netcdf-c/current/nc4hdf_8c_source.html line 1155 / nc4_adjust_var_cache.

Environment: netcdf-4.9.2 and netcdf-4.8.1, tested on linux (e.g. ubuntu 22.04 default netcdf-4.8.1 or rh-el8 with latest netcdf from conda 4.9.2) The problem occurs both with python-netCDF4 and in an in-house C++ application: https://github.com/metno/fimex

Test-program (in python) is attached. It creates first 5 larger files (5 variables of size 64MB=320MB) in the local directory and then reads them with modified and default chunk-cache size. test_netcdf_memusage.zip With the basic reading function:

def netcdf_test(paths: list):
    nc_list = []
    for f in paths:
        nc = netCDF4.Dataset(f, 'r')
        nc_list.append(nc)
        for t in range(nc[f"var0"].shape[0]):
            for i in range(num_vars):
                v = nc[f"var{i}"][t, 0, 0, 0]
        v = None
    # the close is done outside the for-loop to simulate simultaneously opened files
    # as e.g. xarray.open_mfdataset
    for nc in nc_list:
        nc.close()

output:

$ python3 test_netcdf_memusage.py 
after creation of files: 175MB, files: 8
modified chunk-cache:  (16777217, 4133, 0.75)
memory-leak per file*variable with modified chunk-cache, netcdf4: 0MB
total: 175MB, files: 8
default chunk-cache:  (16777216, 4133, 0.75)
memory-leak per file*variable with default chunk-cache, netcdf: 59MB
total: 1650MB, files: 8

So, most of the data is still cached in the chunk-cache, even after all files are closed and no data is held by python/numpy

Best regards, Heiko

heikoklein avatar Apr 30 '24 15:04 heikoklein

Thanks! I'll take a look at this.

WardF avatar Apr 30 '24 16:04 WardF

Looking at this, and the fix that was applied in xarray, I believe I see how we might be able to fix this. Thanks for the report, and your patience!

WardF avatar Jun 11 '24 21:06 WardF

After further investigation, I'm at a loss as to how to address this; perhaps opening a discussion over at the netCDF-Python repository? I've tried to duplicate this issue in pure C, and have not been able to, nor have I been able to uncover any latent memory issues through static and dynamic testing. I'll keep trying, but in the meantime, my limited experience with/knowledge of Python means there isn't a lot I can do, immediately.

I've attached the test files; I've modified the provided python test script, and also the C version I was using for testing.

WardF avatar Jun 12 '24 21:06 WardF

Thanks for looking into that. I've just compiled your c-program against 4.8.1 and am running it with the following output:

$ ./test_netcdf 
after creation of files: 143MB
memory-leak per file*variable with modified chunk-cache, netcdf: 0MB
total: 143MB
memory-leak per file*variable with default chunk-cache, netcdf4: 8MB
total: 339MB

The same with the python version is:

$ python3 test_netcdf_memusage.py 
after creation of files: 171MB, files: 4
modified chunk-cache:  (16777217, 4133, 0.75)
memory-leak per file*variable with modified chunk-cache, netcdf4: 0MB
total: 171MB, files: 4
default chunk-cache:  (16777216, 4133, 0.75)
memory-leak per file*variable with default chunk-cache, netcdf: 59MB
total: 1647MB, files: 4

I recognized a small difference between the python and the C version which is the hard-coded values of the cache. You set '16777216, 1000, 0.75' while the default cache of my version is according to nc_get_chunk_cache: '16777216, 4133, 0.75' I tried to modify adapt these values, but the result is still that I don't see the memory leak in the C version which is visible in the python version.

heikoklein avatar Jun 13 '24 14:06 heikoklein

I appreciate you double-checking in your environment; are you able to open an issue over at netcdf4-python? I'd be happy to open one and link this to it, but you may be able to provide more relevant information given your understanding of Python and the underlying test case.

WardF avatar Jun 13 '24 17:06 WardF

I notice your test with 4.8.1 still has an 8MB memory leak per file with the default chunk sizes; I was able to replicate this amongst various versions. It looks like the fix was introduced after the v4.9.2 release. I am currently working on the first release candidate for v4.9.3, so hopefully that should be able to get this solved!

WardF avatar Jun 13 '24 17:06 WardF

While looking again at your program and I found an issue with your code reading var 0-5, while the data-variables varids are 4 to 8 (in my case). Updated version can be found in test_programs.zip

I changed

for (int t = 0; t < NUM_VARS; ++t) {
    ...
    nc_get_var1_float(ncids[i], t, ....)

to

for (int t = 0; t < NUM_VARS; ++t) {
    ...
    sprintf(var_name, "var%d", t);
    nc_inq_varid(ncids[i], var_name, &varid);
    nc_get_vara(ncids[i], varid, index, count, &value);

and I see the same memory consumption as with the python version:

$ ./test_netcdf 
default chunk_cache: 16777216 4133 0.750000
after creation of files: 143MB
total before closing files: 143MB
memory-leak per file*variable with modified chunk-cache, netcdf: 0MB
total: 143MB
total before closing files: 1618MB
memory-leak per file*variable with default chunk-cache, netcdf4: 59MB
total: 1618MB

I haven't checked v4.9.3 yet, but as far as I've seen, the default chunk-sizes have been change, and the test-files chunks are too big to fit into v4.9.3 cache? I hope you can re-open the issue. I will close the netCDF4-python issue.

heikoklein avatar Jun 14 '24 15:06 heikoklein

I compiled now the latest netcdf-version from git and I don't see the huge memory-consumption any longer, even after looking to adapt the new chunk-sizes. This solves this issue.

Looking at the new code for nc4_adjust_var_cache and considering that CHUNK_CACHE_SIZE == DEFAULT_CHUNK_CACHE_SIZE (at least according to my config.h):

    if (var->chunkcache.size == CHUNK_CACHE_SIZE)
        if (chunk_size_bytes > var->chunkcache.size)
        {
            var->chunkcache.size = chunk_size_bytes * DEFAULT_CHUNKS_IN_CACHE;
            if (var->chunkcache.size > DEFAULT_CHUNK_CACHE_SIZE)
                var->chunkcache.size = DEFAULT_CHUNK_CACHE_SIZE;
            if ((retval = nc4_reopen_dataset(grp, var)))
                return retval;
        }

This code block will never change var->chunkcache.size (it is made > CHUNK_CACHE_SIZE and therefore reset to CHUNK_CACHE_SIZE), so any automatic adjustment of the chunk-cache was disabled from 4.9.3 and nc4_adjust_var_cache dead code, as far as I understand?

heikoklein avatar Jun 16 '24 13:06 heikoklein

That is correct, and thanks!

WardF avatar Jun 17 '24 16:06 WardF