`open_mfdataset` hangs indefinitely with h5netcdf and dask since commit ea9f02 -> v 2025.9.1
What happened?
Description
Since https://github.com/pydata/xarray/commit/ea9f02bbe6d3b02fbb56600710b2792795e0e4a5 and #10571, the following snippet hangs indefinitely — the files are opened successfully, but xr.open_mfdataset seems to hang and never completes. I end up having to kill the process. Reverting to the commit just before this change works fine.
open_mfdataset_params = {
"engine": engine,
"parallel": True,
"preprocess": partial_preprocess,
"data_vars": "all",
"concat_characters": True,
"mask_and_scale": True,
"decode_cf": True,
"decode_times": self.time_coder,
"decode_coords": True,
"compat": "override",
"coords": "minimal",
"drop_variables": drop_vars_list,
}
ds = xr.open_mfdataset(batch_files, **open_mfdataset_params)
Context
- Engine:
h5netcdf - Scheduler: Dask running on a Coiled cluster
- Input:
batch_filesis a fileset array of NetCDF objects stored in S3 (fileset = [s3_fs.open(file) for file in s3_paths] -
preprocessruns successfully on all files, butopen_mfdatasetitself gets stuck
What I’ve checked
-
I’m aware of https://github.com/pydata/xarray/issues/10712, and like @rabernat, that change broke my code. I actually caught it in my own integration tests, whereas the xarray unit tests didn’t expose it at the time. More generally, I think some of these regressions could be avoided if there were higher-level integration tests in addition to the existing unit tests. That particular issue was fixed about two weeks ago and resolved the pickle error I was seeing. However, I’m still encountering the hang I describe here.
-
Rolling back to the commit before https://github.com/pydata/xarray/commit/ea9f02bbe6d3b02fbb56600710b2792795e0e4a5 avoids the problem. In my case, 938e18680cbd9649f375d1544af8e5a16c352210 is the latest working commit
What did you expect to happen?
xr.open_mfdataset should complete successfully, as it does on earlier commits.
Minimal Complete Verifiable Example
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!
import xarray as xr
xr.show_versions()
# your reproducer code ...
Steps to reproduce
No response
MVCE confirmation
- [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 6.8.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: en_IE.UTF-8 LOCALE: ('C', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development
xarray: 2025.7.2.dev31+gea9f02bbe pandas: 2.3.3 numpy: 1.26.4 scipy: 1.16.2 netCDF4: 1.6.5 pydap: 3.5.7 h5netcdf: 1.6.4 h5py: 3.11.0 zarr: 2.18.7 cftime: 1.6.4.post1 nc_time_axis: 1.4.1 iris: None bottleneck: 1.6.0 dask: 2025.9.1 distributed: 2025.9.1 matplotlib: 3.10.6 cartopy: 0.25.0 seaborn: 0.13.2 numbagg: 0.9.3 fsspec: 2025.5.1 cupy: None pint: None sparse: 0.17.0 flox: 0.10.7 numpy_groupies: 0.11.3 setuptools: 80.9.0 pip: 25.2 conda: None pytest: 8.4.2 mypy: None IPython: 7.34.0 sphinx: 8.2.3
Further to my first post, I realised the process doesn't hang, but it takes an excruciating amount of time doing "nothing" compared to the last working commit.
I looked at this obscure option xr.set_options(use_new_combine_kwarg_defaults=True) found in https://github.com/pydata/xarray/issues/1385#issuecomment-3144866423 but this hasn't helped.
For comparaison, here is the coiled dashboard with the latest 2025.9.1 release. The empty part between the green bars and blue is where nothing happens while opening 50 NetCDF files. In my case it takes 7minutes.
The dask status page is now painfully slow.
With the latest working xarray commit 938e186, I don't have this problem (see screenshot below on exactly the same sample of data as above), so something clearly happened recently, but didnt show in my unittests because it's still technically working, just very very slowly.
I ended up having to create a fork of xarray with a version locked to June 2025, and applied manually some xarray bug fix I needed in order to have xarray working as expected for my case (https://github.com/lbesnard/xarray).
Does this issue need more information?
Does this issue need more information?
What would help this issue is a minimal complete verifiable example
Without this, it will be impossible for Xarray developers to identify the root cause and fix this issue.
@lbesnard I remember another change shortly after that, which directly affects open_mfdataset #9955.
https://github.com/pydata/xarray/blob/101a5c2116a7ca9b658b5def4cab9b5e2688d590/xarray/backends/api.py#L1630-L1650
Would you mind checking, if reverting the changes in that PR fixes your problem? It seems that wrapping with dask.delayed will not raise an error related to opening but it will surely swallow any other errors. So we should definitely restrict the errors here and find a way to make this work.
@kmuehlbauer, thanks I removed the changes from that PR as suggested (with the latest release of xarray) and that didn't do the trick.
The NetCDF files get opened fine by open_mfdataset, then preprocessed by a custom preprocess function (still within open_mfdataset), and then everything hangs for a long time. After 15 min, I usually give up, kill the process as it costs cloud money, and it's just not sustainable.
Here is the dask graph showing that all the workers are close to 0% usage.
@rabernat understand the requirement for MCVE, however this can often be a big bottleneck on lodging issues. My code has integration tests in place, using moto, dask, xarray... These tests work fine with new versions of xarray, but only deal with 2 small NetCDF files for example. My issue is a real life example, processing a lot of data, and in a way, it's kind of a basic open_mfdataset call with remote NetCDF4 files.