netcdf4-python
netcdf4-python copied to clipboard
Problem of RAM memory exhaustion for datasets with unlimited axis?
First, credits to @BrazhnikovDmitry for finding this, I am only writing the issue but he should get credit for pointing this out :) .
It seems like opening a dataset with an unlimited dimension can cause RAM memory exhaustion and crash. For example, I have a file with an unlimited dimension of size:
time = UNLIMITED ; // (3235893 currently)
The data file is relatively big for using on my local machine (a laptop), but not huge: 1.6GB in total. My local machine has 16GB or RAM, out of which 8GB + are completely free.
When trying to open a small slice of a field of the dataset (the first index is an "instrument ID", the second index is the unlimited time dimension):
[ins] In [1]: import netCDF4 as nc4
[ins] In [2]: file_path = "wave_data_ICEX2018.nc"
[ins] In [3]: with nc4.Dataset(file_path, "r", format="NETCCDF4") as nc4_fh:
...: time_gps = nc4_fh["timeIMU"][0, 0:1000]
...:
all goes well.
But when trying to open the whole field:
[ins] In [5]: with nc4.Dataset(file_path, "r", format="NETCCDF4") as nc4_fh:
...: data_lat = nc4_fh["timeIMU"][0, :]
Killed
all the RAM gets exhausted (while I had over 8GB of RAM available when starting the command; seems like RAM use increases almost linearly over the course of a few seconds, until it is exhausted), and the process gets killed automatically (which is great actually, because, as you can imagine, my whole system freezes when all RAM gets used, so nice that somehow the process gets killed and my system responsiveness is restored :) ).
The interesting thing is, packaging the exact same data, but with a fixed dimension size, the whole field can be open with the same [0, :]
without encountering any issue, and using just a few 100s MBs of RAM.
- any idea why this happens
- is this well a bug (I can understand that an unlimited dimension may be less efficient than a statically sized one, but not that it is so inefficient that such a crash happens)
version and system information
-
OS: Ubuntu 20.04, fully updated
-
ipython:
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.
[ins] In [1]: import netCDF4
[ins] In [2]: netCDF4.__version__
Out[2]: '1.5.3'
Please post the data file somewhere and put the link in this issue.
@jerabaul29 we really can't make any progress on diagnosing the problem without having access to the data file. Is there any problem with providing access to the dataset?
Hi @jswhit2! I initially encountered the problem with the memory over usage. The example data set can be found here https://www.dropbox.com/s/zk6js1cmt6p2tj9/wave_data_bad.nc?dl=0 If it is necessary I can provide the code used to generate the nc-file.
Many thanks for uploading your example file @BrazhnikovDmitry :) . @jswhit2 sorry for the absence of response on my side, I was traveling, some backlogs. I can confirm that I get the error on the exact file @BrazhnikovDmitry uploaded now :) .
OK, I've got the file now, thanks. Just curious why you decided to make the 'time' unlimited dimension the rightmost dimension (last in the list of dimensions for that variable). Typically the unlimited dimension is defined as the leftmost (slowest varying) dimension. I bet that if you had done it that way accessing the data along the unlimited dimension would be much faster.
On MacOS with the latest github master for both netcdf4-python and netcdf-c I don't see this problem. Here's my simple test script:
from netCDF4 import Dataset
import tracemalloc, time
def read_data():
nc = Dataset('wave_data_bad.nc')
data = nc["timeIMU"][0, :]
nc.close()
tracemalloc.start()
# function call
t1 = time.perf_counter()
read_data()
t2 = time.perf_counter()
print('time = %s secs' % str(t2-t1))
# displaying the memory
print('peak memory = %s bytes' % tracemalloc.get_traced_memory()[1])
# stopping the library
tracemalloc.stop()
>> time = 110.724687782 secs
>> peak memory = 51784442 bytes
I'm pretty sure nothing has changed in the python module that would impact this, so perhaps it's something that could be remedied by updating the netcdf and hdf5 C libs?
Ok, interesting. I saw it on Linux Ub 20.04 fully up to date as previously mentioned, just curious, @BrazhnikovDmitry which OS and version are you using? :)
It was my thought as well. I did not have time to update to the latest netcdf llibrary and have a check. The file was created with 4.7.4. According to https://github.com/Unidata/netcdf-c/pull/1913 they fixed some memory leaks in 4.8.0.
I've also encountered this issue with unlimited dimensions but I solved it similar to https://github.com/Unidata/netcdf4-python/issues/859, by increasing the chunksize of the unlimited dimension.