xarray icon indicating copy to clipboard operation
xarray copied to clipboard

`open_mfdataset` hangs indefinitely with h5netcdf and dask since commit ea9f02 -> v 2025.9.1

Open lbesnard opened this issue 5 months ago • 4 comments

What happened?

Description Since https://github.com/pydata/xarray/commit/ea9f02bbe6d3b02fbb56600710b2792795e0e4a5 and #10571, the following snippet hangs indefinitely — the files are opened successfully, but xr.open_mfdataset seems to hang and never completes. I end up having to kill the process. Reverting to the commit just before this change works fine.

open_mfdataset_params = {
    "engine": engine,
    "parallel": True,
    "preprocess": partial_preprocess,
    "data_vars": "all",
    "concat_characters": True,
    "mask_and_scale": True,
    "decode_cf": True,
    "decode_times": self.time_coder,
    "decode_coords": True,
    "compat": "override",
    "coords": "minimal",
    "drop_variables": drop_vars_list,
}

ds = xr.open_mfdataset(batch_files, **open_mfdataset_params)

Context

  • Engine: h5netcdf
  • Scheduler: Dask running on a Coiled cluster
  • Input: batch_files is a fileset array of NetCDF objects stored in S3 (fileset = [s3_fs.open(file) for file in s3_paths]
  • preprocess runs successfully on all files, but open_mfdataset itself gets stuck

What I’ve checked

  • I’m aware of https://github.com/pydata/xarray/issues/10712, and like @rabernat, that change broke my code. I actually caught it in my own integration tests, whereas the xarray unit tests didn’t expose it at the time. More generally, I think some of these regressions could be avoided if there were higher-level integration tests in addition to the existing unit tests. That particular issue was fixed about two weeks ago and resolved the pickle error I was seeing. However, I’m still encountering the hang I describe here.

  • Rolling back to the commit before https://github.com/pydata/xarray/commit/ea9f02bbe6d3b02fbb56600710b2792795e0e4a5 avoids the problem. In my case, 938e18680cbd9649f375d1544af8e5a16c352210 is the latest working commit

What did you expect to happen?

xr.open_mfdataset should complete successfully, as it does on earlier commits.

Minimal Complete Verifiable Example

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!

import xarray as xr
xr.show_versions()
# your reproducer code ...

Steps to reproduce

No response

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output


Anything else we need to know?

No response

Environment

In [3]: xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 6.8.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: en_IE.UTF-8 LOCALE: ('C', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2025.7.2.dev31+gea9f02bbe pandas: 2.3.3 numpy: 1.26.4 scipy: 1.16.2 netCDF4: 1.6.5 pydap: 3.5.7 h5netcdf: 1.6.4 h5py: 3.11.0 zarr: 2.18.7 cftime: 1.6.4.post1 nc_time_axis: 1.4.1 iris: None bottleneck: 1.6.0 dask: 2025.9.1 distributed: 2025.9.1 matplotlib: 3.10.6 cartopy: 0.25.0 seaborn: 0.13.2 numbagg: 0.9.3 fsspec: 2025.5.1 cupy: None pint: None sparse: 0.17.0 flox: 0.10.7 numpy_groupies: 0.11.3 setuptools: 80.9.0 pip: 25.2 conda: None pytest: 8.4.2 mypy: None IPython: 7.34.0 sphinx: 8.2.3

lbesnard avatar Oct 02 '25 04:10 lbesnard

Further to my first post, I realised the process doesn't hang, but it takes an excruciating amount of time doing "nothing" compared to the last working commit.

I looked at this obscure option xr.set_options(use_new_combine_kwarg_defaults=True) found in https://github.com/pydata/xarray/issues/1385#issuecomment-3144866423 but this hasn't helped.

For comparaison, here is the coiled dashboard with the latest 2025.9.1 release. The empty part between the green bars and blue is where nothing happens while opening 50 NetCDF files. In my case it takes 7minutes.

Image

The dask status page is now painfully slow. Image

With the latest working xarray commit 938e186, I don't have this problem (see screenshot below on exactly the same sample of data as above), so something clearly happened recently, but didnt show in my unittests because it's still technically working, just very very slowly.

Image

lbesnard avatar Oct 03 '25 02:10 lbesnard

I ended up having to create a fork of xarray with a version locked to June 2025, and applied manually some xarray bug fix I needed in order to have xarray working as expected for my case (https://github.com/lbesnard/xarray).

Does this issue need more information?

lbesnard avatar Oct 21 '25 01:10 lbesnard

Does this issue need more information?

What would help this issue is a minimal complete verifiable example

Without this, it will be impossible for Xarray developers to identify the root cause and fix this issue.

rabernat avatar Oct 21 '25 11:10 rabernat

@lbesnard I remember another change shortly after that, which directly affects open_mfdataset #9955.

https://github.com/pydata/xarray/blob/101a5c2116a7ca9b658b5def4cab9b5e2688d590/xarray/backends/api.py#L1630-L1650

Would you mind checking, if reverting the changes in that PR fixes your problem? It seems that wrapping with dask.delayed will not raise an error related to opening but it will surely swallow any other errors. So we should definitely restrict the errors here and find a way to make this work.

kmuehlbauer avatar Oct 21 '25 12:10 kmuehlbauer

@kmuehlbauer, thanks I removed the changes from that PR as suggested (with the latest release of xarray) and that didn't do the trick. The NetCDF files get opened fine by open_mfdataset, then preprocessed by a custom preprocess function (still within open_mfdataset), and then everything hangs for a long time. After 15 min, I usually give up, kill the process as it costs cloud money, and it's just not sustainable.

Here is the dask graph showing that all the workers are close to 0% usage. Image

@rabernat understand the requirement for MCVE, however this can often be a big bottleneck on lodging issues. My code has integration tests in place, using moto, dask, xarray... These tests work fine with new versions of xarray, but only deal with 2 small NetCDF files for example. My issue is a real life example, processing a lot of data, and in a way, it's kind of a basic open_mfdataset call with remote NetCDF4 files.

lbesnard avatar Dec 17 '25 05:12 lbesnard