Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None
I am testing out code that uses xarray to process netcdf files, in particular to join multiple netcdf files into one along shared dimensions. This was working well, except sometimes when saving the netcdf file the process would hang.
I was able to whittle it down to this simple example: https://github.com/jessicaaustin/xarray_netcdf_hanging_issue
This is the code snippet at the core of the example:
# If you set lock=False then this runs fine every time.
# Setting lock=None causes it to intermittently hang on mfd.to_netcdf
with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
print(f"Writing data to {p}")
mfd.to_netcdf(p)
print("complete")
If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.
I've seen this with the following combos:
- xarray=0.14.1
- dask=2.9.1
- netcdf4=1.5.3
and
- xarray=0.15.1
- dask=2.14.0
- netcdf4=1.5.3
And I've tried it with different netcdf files and different computers.
Versions
Output of `xr.show_versions()`
INSTALLED VERSIONS
commit: None python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-20-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4
xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.1 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.1.1.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.14.0 distributed: 2.14.0 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.0.2 conda: None pytest: None IPython: None sphinx: None
Thanks @jessicaaustin. We have run into the same issue. Setting lock=False works, but as hdf5 is not thread safe, we are not sure if this could have unexpected consequences.
Edit: Actually, I have checked, and the hdf5 version we are using (from conda-forge) is build in thread safe mode. This means that concurrent reads are possible, and that the lock=False in open_mfdataset would be safe. In fact it is more efficient as it does not make sense to handle locks if hdf5 is already thread safe. Am I right?
Using:
- xarray=0.15.1
- dask=2.14.0
- netcdf4=1.5.3
I have experienced this issue as well when writing netcdf using xr.save_mfdataset on a dataset opened using xr.open_mfdataset. As described by OP it hangs when using lock=None (default behavior) on xr.open_mfdataset(), but works fine when using lock=False.
Using:
- xarray=0.16.0
- dask=2.25.0
- netcdf4=1.5.4
I am experiencing same when trying to write netcdf file using xr.to_netcdf() on a files opened via xr.open_mfdataset with lock=None.
Then I tried OP's suggestion and it worked like a charm
BUT
Now I am facing different issue. Seems that hdf5 IS NOT thread safe, since I encounter NetCDF: HDF error while applying different function on a netcdf file, previously were processed by another function with lock=False. script just terminates not even reaching any calculation step in the code. seems like lock=False works opposite and file is in a corrupted mode?
This is the BIGGEST issue and needs resolve ASAP
I have the same issue as well and it appears to me that Ubuntu system is more prone to this issue vs. CentOS. Wondering if anyone else has a similar experience
I have the same behaviour with MacOS (10.15). xarray=0.16.1, dask=2.30.0, netcdf4=1.5.4. Sometimes saves, sometimes doesn't. lock=False seems to work.
I have the same behaviour with MacOS (10.15). xarray=0.16.1, dask=2.30.0, netcdf4=1.5.4. Sometimes saves, sometimes doesn't.
lock=Falseseems to work.
Lock false sometimes throws hd5 error. No clear solution.
The only solution I have found, sleep method for 1 second
Lock false sometimes throws hd5 error. No clear solution.
I haven't seen that yet, but I'd still far prefer an occasional error to a hung process.
Just adding my +1 here, and also mention that (if memory allows), ds.load() also helps. (related: https://github.com/pydata/xarray/issues/4710)
Also seeing this as of version 0.16.1.
In some cases, I need lock=False otherwise I'll run into hung processes a certain percentage of the time. ds.load() prior to to_netcdf() does not solve the problem.
In other cases, I need lock=None otherwise I'll consistently get RuntimeError: NetCDF: Not a valid ID.
Is the current recommended solution to set lock=False and retry until success? Or, is it to keep lock=None and use zarr instead? @dcherian
Is the current recommended solution to set
lock=Falseand retry until success? Or, is it to keeplock=Noneand usezarrinstead? @dcherian
Or alternatively you can try to set sleep between openings.
When you try to open same file from different functions with different operations, it is better to keep file opening function wrapped with a 1 second delay/sleep rather than direct open
Or alternatively you can try to set sleep between openings.
To clarify, do you mean adding a sleep of e.g. 1 second prior to your preprocess function (and setting preprocess to just sleep then return ds if you're not doing any preprocessing)? Or, are you instead sleeping before the entire open_mfdataset call?
Is this solution only addressing the issue of opening the same ds multiple times within a python process, or would it also address multiple processes opening the same ds?
Please make some dummy tests, I did time.sleep, prior every operation. This was the only workaround that really worked.
Any progress in solving this problem? I am using
- xarray 0.20.1
- netcdf4 1.6.2 None of the above suggestions (lock=False, time.sleep(1)) works for me.