xarray icon indicating copy to clipboard operation
xarray copied to clipboard

memory bug with Dataset starting in version 2025.3.0 in combination with dask

Open Krenciszek opened this issue 6 months ago • 2 comments

What happened?

Hi, I think I found a memory bug that happens when using xarray from version 2025.3.0 when also dask in any version is present. The memory of the very fist Dataset created is never released. For all later created Datasets it works and a workaround for me is in fact to initialize a small Dataset at the beginning of the code.

What did you expect to happen?

a deleted Dataset should release the memory as in xarray version 2025.1.2 or older.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# starting with defining a tiny Dataset would mitigate the problem, as only the very first Dataset is never released from memory
# xr.Dataset({}, coords={ "a": [1]})

def dummy(n):
    ds = xr.Dataset(        {
            "A": (["x", "y"], np.random.randn(n, n))
        },
        coords={
            "x": range(n),
            "y": range(n),
        },
        )


dummy(25000)
input("Check your memory usage now... ~4GB is not released")

# Dockerfile to reproduce
# FROM python:3.11.5-slim-bullseye
# RUN pip install xarray==2025.1.2 dask==2025.5.1
# Using xarray==2025.1.2 or lower shows correct behavior

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

using docker stats to monitor memory usage (while waiting for user prompt):
MEM USAGE 4.728GiB

When using xarray version 2025.1.2:
MEM USAGE: 82.77MiB

Anything else we need to know?

initializing a tiny Datasets at the top of the code mitigates the problem xr.Dataset({}, coords={ "a": [1]})

funnily, even calling xr.show_versions() does....

Feels like the very first call to Dataset leaves a reference somewhere, so it is not picked up by the garbage collector.

Might be related to: https://github.com/pydata/xarray/issues/9807 But here we have a much simpler minimal example.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.5 (main, Sep 20 2023, 11:03:59) [GCC 10.2.1 20210110] python-bits: 64 OS: Linux OS-release: 5.15.167.4-microsoft-standard-WSL2 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2025.4.0 pandas: 2.2.3 numpy: 2.2.6 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2025.5.1 distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2025.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 23.2.1 conda: None pytest: None mypy: None IPython: None sphinx: None

Krenciszek avatar May 23 '25 18:05 Krenciszek

This is probably just that the garbage collector hasn't run. Please use an explicit import gc; gc.collect() to verify that memory isn't leaked

dcherian avatar May 28 '25 21:05 dcherian

I tested that gc.collect() has no effect here.

Krenciszek avatar May 29 '25 09:05 Krenciszek