xarray
xarray copied to clipboard
Uncontrolled memory growth in custom backend
What happened?
I wrote a custom backend. I'm using it to open a file, operate on the data, remove most of it from the Dataset using .isel
, open the next, concatenate, and repeat. I noticed the memory used by the system grew significantly over time even though the size of the Dataset
was approximately the same. I was able to reproduce the problem without most of this complexity.
I repeatedly created a dummy Dataset
with 25 Variable
s and observed the number of objects with objgraph after each object creation. I see Variable
instances continually increasing, even though I have del
'd the Dataset
after creating it. I think this suggests that something in xarray
is not releasing the Dataset
.
Variable 75 +25
ReferenceType 8511 +4
lock 46 +4
dict 21283 +2
KeyedRef 22 +2
SerializableLock 9 +2
list 23256 +1
set 1873 +1
method 1893 +1
SeedSequence 4 +1
---
Variable 100 +25
ReferenceType 8515 +4
lock 50 +4
dict 21285 +2
KeyedRef 24 +2
SerializableLock 11 +2
list 23257 +1
set 1874 +1
method 1894 +1
SeedSequence 5 +1
I picked a random Variable
that was not released and printed the reference chain graph.
import random
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Variable')), objgraph.is_proper_module), filename='graph.png')
What did you expect to happen?
I expected the memory used for the Dataset to be released and garbage-collected. I expected the memory in use to plateau instead of grow.
Minimal Complete Verifiable Example
import os
os.system("mamba install --yes objgraph")
import io
import numpy as np
import xarray as xr
from xarray.backends import BackendArray, BackendEntrypoint
from xarray.backends.common import AbstractDataStore
from xarray.backends.file_manager import CachingFileManager
from xarray.backends.locks import SerializableLock, ensure_lock
from xarray.backends.store import StoreBackendEntrypoint
from xarray.core.indexing import (
IndexingSupport,
explicit_indexing_adapter,
)
from xarray.core.utils import Frozen, close_on_error
# This backend creates a random 500MB variable and stores it into a Dataset 25 times
class memtest_DataStore(AbstractDataStore):
"""Store for binary data"""
def __init__(
self,
filename,
lock=None,
num_bytes=500_000_000,
**kwargs,
):
if lock is None:
lock = SerializableLock()
self.lock = ensure_lock(lock)
# Create a lock for the file manager
self._fm_lock = SerializableLock()
self._manager = CachingFileManager(
self._open_file,
filename,
lock=self._fm_lock,
kwargs=kwargs,
)
self.num_bytes = num_bytes
self._num_vars = 25
self._rng = np.random.default_rng()
self._xr_obj = None
def _open_file(self, filename, **kwargs):
b = self._rng.random(self.num_bytes // 8, dtype=np.float64)
file_obj = io.BytesIO(b, **kwargs)
return file_obj
def _create_dataset(self):
with self._manager.acquire_context(needs_lock=True) as file_obj:
file_obj.seek(0)
fltarr = np.frombuffer(file_obj.getvalue(), np.float64)
xr_obj = xr.Dataset()
# Create coordinates
xr_obj["index"] = (
("index",),
range(len(fltarr)),
)
# Create several variables so we can see the change
for i in range(self._num_vars):
xr_obj[f"var{i}"] = (
("index",),
fltarr,
)
return xr_obj
@property
def ds(self):
if not self._xr_obj:
self._xr_obj = self._create_dataset()
return self._xr_obj
def open_store_variable(self, name, var):
return self.ds[name].variable
def get_variables(self):
return Frozen(self.ds.variables)
def get_attrs(self):
return Frozen(self.ds.attrs)
def get_dimensions(self):
return Frozen(self.ds.dims)
def close(self):
self._manager.close()
class memtest_backend(BackendEntrypoint):
description = "Memory test backend"
def open_dataset(
self,
filename_or_obj,
*,
drop_variables=None,
lock=None,
num_bytes=500_000_000,
):
store = memtest_DataStore(
filename_or_obj,
lock=lock,
num_bytes=num_bytes,
)
store_entrypoint = StoreBackendEntrypoint()
with close_on_error(store):
ds = store_entrypoint.open_dataset(
store,
drop_variables=drop_variables,
)
return ds
# Loop to test memory growth
# This results in about 10GB of additional memory used by the end of the process
import xarray as xr
import objgraph
objgraph.show_growth(limit=1)
for i in range(10):
ds = xr.open_dataset(i, engine=memtest_backend)
# Just to test that something's still holding on to the data
del ds
objgraph.show_growth()
print('---')
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
function 18610 +18610
list 15403 +12798
function 25617 +7007
dict 13365 +6005
tuple 13360 +5284
CompositeUnit 2954 +2954
PrefixUnit 1526 +1526
ReferenceType 4284 +1181
cell 4966 +1112
type 2397 +792
getset_descriptor 3374 +628
---
Variable 50 +25
ReferenceType 4298 +14
builtin_function_or_method 2537 +10
lock 34 +4
dict 13367 +2
KeyedRef 19 +2
SerializableLock 7 +2
list 15404 +1
set 952 +1
method 369 +1
---
Variable 75 +25
ReferenceType 4302 +4
lock 38 +4
dict 13369 +2
KeyedRef 21 +2
SerializableLock 9 +2
list 15405 +1
set 953 +1
method 370 +1
SeedSequence 4 +1
---
Variable 100 +25
ReferenceType 4306 +4
lock 42 +4
dict 13371 +2
KeyedRef 23 +2
SerializableLock 11 +2
list 15406 +1
set 954 +1
method 371 +1
SeedSequence 5 +1
---
Variable 125 +25
ReferenceType 4310 +4
lock 46 +4
dict 13373 +2
KeyedRef 25 +2
SerializableLock 13 +2
list 15407 +1
set 955 +1
method 372 +1
SeedSequence 6 +1
---
Variable 150 +25
ReferenceType 4314 +4
lock 50 +4
dict 13375 +2
KeyedRef 27 +2
SerializableLock 15 +2
list 15408 +1
set 956 +1
method 373 +1
SeedSequence 7 +1
---
Variable 175 +25
ReferenceType 4318 +4
lock 54 +4
dict 13377 +2
KeyedRef 29 +2
SerializableLock 17 +2
list 15409 +1
set 957 +1
method 374 +1
SeedSequence 8 +1
---
Variable 200 +25
ReferenceType 4322 +4
lock 58 +4
dict 13379 +2
KeyedRef 31 +2
SerializableLock 19 +2
list 15410 +1
set 958 +1
method 375 +1
SeedSequence 9 +1
---
Variable 225 +25
ReferenceType 4326 +4
lock 62 +4
dict 13381 +2
KeyedRef 33 +2
SerializableLock 21 +2
list 15411 +1
set 959 +1
method 376 +1
SeedSequence 10 +1
---
Variable 250 +25
ReferenceType 4330 +4
lock 66 +4
dict 13383 +2
KeyedRef 35 +2
SerializableLock 23 +2
list 15412 +1
set 960 +1
method 377 +1
SeedSequence 11 +1
---
Anything else we need to know?
This crashes the Binder notebook instance since it uses so much memory.
Environment
INSTALLED VERSIONS
commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 5.14.0-505.el9.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2
xarray: 2024.9.0 pandas: 2.2.3 numpy: 1.26.4 scipy: None netCDF4: 1.7.1 pydap: None h5netcdf: 1.2.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2023.9.3 distributed: 2023.9.3 matplotlib: 3.9.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None