xarray
xarray copied to clipboard
memory bug with Dataset starting in version 2025.3.0 in combination with dask
What happened?
Hi, I think I found a memory bug that happens when using xarray from version 2025.3.0 when also dask in any version is present. The memory of the very fist Dataset created is never released. For all later created Datasets it works and a workaround for me is in fact to initialize a small Dataset at the beginning of the code.
What did you expect to happen?
a deleted Dataset should release the memory as in xarray version 2025.1.2 or older.
Minimal Complete Verifiable Example
import xarray as xr
import numpy as np
# starting with defining a tiny Dataset would mitigate the problem, as only the very first Dataset is never released from memory
# xr.Dataset({}, coords={ "a": [1]})
def dummy(n):
ds = xr.Dataset( {
"A": (["x", "y"], np.random.randn(n, n))
},
coords={
"x": range(n),
"y": range(n),
},
)
dummy(25000)
input("Check your memory usage now... ~4GB is not released")
# Dockerfile to reproduce
# FROM python:3.11.5-slim-bullseye
# RUN pip install xarray==2025.1.2 dask==2025.5.1
# Using xarray==2025.1.2 or lower shows correct behavior
MVCE confirmation
- [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
using docker stats to monitor memory usage (while waiting for user prompt):
MEM USAGE 4.728GiB
When using xarray version 2025.1.2:
MEM USAGE: 82.77MiB
Anything else we need to know?
initializing a tiny Datasets at the top of the code mitigates the problem xr.Dataset({}, coords={ "a": [1]})
funnily, even calling xr.show_versions() does....
Feels like the very first call to Dataset leaves a reference somewhere, so it is not picked up by the garbage collector.
Might be related to: https://github.com/pydata/xarray/issues/9807 But here we have a much simpler minimal example.
Environment
xarray: 2025.4.0 pandas: 2.2.3 numpy: 2.2.6 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2025.5.1 distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2025.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 23.2.1 conda: None pytest: None mypy: None IPython: None sphinx: None
This is probably just that the garbage collector hasn't run. Please use an explicit import gc; gc.collect() to verify that memory isn't leaked
I tested that gc.collect() has no effect here.