opendrift
opendrift copied to clipboard
Reading in large amount of reader files: memory limit
I am working with SCHISM model files that contain a single time step each. At the moment I am reading in two months worth of files using:
data_path0 = '/<PATH>/schout_*.nc'
reader0 = reader_schism_native.Reader(data_path0,proj4='+proj=utm +zone=4 +ellps=WGS84 +datum=WGS84 +units=m +no_defs')
However that kills the run due to exceeding memory limit. Each timestep/model file is 270mb, so is creating the reader attempting to allocate 388gb of memory? Is there a better way to create the readers so it only accesses the timesteps one at a time?
The dataset is in this case opened with Xarray open_mfdataset: https://github.com/OpenDrift/opendrift/blob/master/opendrift/readers/reader_schism_native.py#L113 Maybe there is some memory leak there?
In the generic reader, there are some more options provided to open_mfdataset: https://github.com/OpenDrift/opendrift/blob/master/opendrift/readers/reader_netCDF_CF_generic.py#L100 Can you try if any of these options could solve the problem? I do not have any SCHISM files available for testing.
I've tried adding those arguments and still getting the same issue. To confirm, is the intended behavior to read the files in as needed, or does the simulation need to be able to hold all the reader files in memory at once?
Update: reading in 2000 hourly timesteps using 'schout_*.nc'
kills due to memory limit, but if I read the files in multiple readers of smaller chunks of between 100 and 1000 files (e.g. 'schout_??.nc'
,'schout_???.nc'
,'schout_1???.nc'
, the memory limit is not reached and I'm able to successfully complete a simulation! It takes 20+ minutes to read in, does that seem reasonable for this amount of data?
See this parallel issue: https://github.com/OpenDrift/opendrift/discussions/1241#discussioncomment-8869454
So you could also try to install h5netcdf
with conda install h5netcdf
and add engine="h5netcdf"
to open_mfdataset in the SCHISM reader.