xarray icon indicating copy to clipboard operation
xarray copied to clipboard

open_mfdataset parallel=True failing on first attempt

Open cefect opened this issue 1 year ago • 8 comments

What happened?

When using the parallel=True key, open_mfdataset fails with NetCDF: Unknown file format. Running the same command again (with try+except), or with parallel=False executes as expected.

works:

xr.open_mfdataset(dirpath +'\\*.nc', parallel=False)

works:

try:
   xr.open_mfdataset(dirpath +'\\*.nc', parallel=True)
except:
   xr.open_mfdataset(dirpath +'\\*.nc', parallel=True)

fails:

xr.open_mfdataset(dirpath +'\\*.nc', parallel=True)

[Errno -51] NetCDF: Unknown file format

all with engine='netcdf4' any help is highly appreciated as I'm a bit lost how to investigate this further.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

cefect avatar Sep 25 '22 13:09 cefect

I ran into this problem yesterday reading netcdf files on our HPC with a known good script and netcdf files. Unfortunately just trying to open the files again in a try..except block did not work for me. Looking back through my environment update history I found that the netcdf4 library had been updated since I'd last successfully run the script. The current version installed was conda-forge/linux-64::netcdf4-1.6.1-nompi_py39hfaa66c4_100; I rolled it back to conda-forge/linux-64::netcdf4-1.6.0-nompi_py39h6ced12a_102. After the rollback the script started working again without error.

pnorton-usgs avatar Sep 29 '22 11:09 pnorton-usgs

I believe you are hitting https://github.com/Unidata/netcdf4-python/issues/1192

The verdict is not out on that one yet. Your parallelization may not be thread safe, which makes 1.6.1 failures that expected. For now, if you can, downgrade to 1.6.0 or use an engine that is thread safe. Maybe h5netcdf (not sure!)?

ocefpaf avatar Oct 04 '22 15:10 ocefpaf

Also, you can try:

import dask
dask.config.set(scheduler="single-threaded")

That would ensure you don't use threads when reading with netcdf-c (netcdf4).


Edit: this is not an xarray problem and I recommend to close this issue and follow up with the one already opened upstream.

ocefpaf avatar Oct 04 '22 19:10 ocefpaf

@ocefpaf and all: thank you! What a mysterious error this has been. Using the workaround

import dask
dask.config.set(scheduler="single-threaded")

did indeed avoid the issue for me.

kthyng avatar Oct 12 '22 19:10 kthyng

Note that this is not a bug per se, netcdf-c was never thread safe and, when the work around were removed in netcdf4-python, this issue surfaced. The right fix is to disable threads, like in my example above, or to wait for a netcdf-c release that is thread safe. I don't think the work around will be re-added in netcdf4-python.

ocefpaf avatar Oct 12 '22 19:10 ocefpaf

The right fix is to disable threads, like in my example above

This fix will restrict you to serial compute.

You can also parallelize across processes using something like

PBSCluster(
	...,
	cores=1,
	processes=2,
)

or LocalCluster(threads_per_worker=1, ...)

dcherian avatar Oct 12 '22 20:10 dcherian

This fix will restrict you to serial compute.

I was waiting for someone who do stuff on clusters to comment on that. Thanks! (My workflow is my own laptop only, so I'm quite limited on that front :smile:)

ocefpaf avatar Oct 12 '22 20:10 ocefpaf

My workflow is my own laptop only

Use LocalCluster! ;)

dcherian avatar Oct 12 '22 20:10 dcherian