xarray Uncontrolled memory growth in custom backend

Uncontrolled memory growth in custom backend

Open sappjw opened this issue 4 months ago • 5 comments

What happened?

I wrote a custom backend. I'm using it to open a file, operate on the data, remove most of it from the Dataset using .isel, open the next, concatenate, and repeat. I noticed the memory used by the system grew significantly over time even though the size of the Dataset was approximately the same. I was able to reproduce the problem without most of this complexity.

I repeatedly created a dummy Dataset with 25 Variables and observed the number of objects with objgraph after each object creation. I see Variable instances continually increasing, even though I have del'd the Dataset after creating it. I think this suggests that something in xarray is not releasing the Dataset.

Variable               75       +25
ReferenceType        8511        +4
lock                   46        +4
dict                21283        +2
KeyedRef               22        +2
SerializableLock        9        +2
list                23256        +1
set                  1873        +1
method               1893        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        8515        +4
lock                   50        +4
dict                21285        +2
KeyedRef               24        +2
SerializableLock       11        +2
list                23257        +1
set                  1874        +1
method               1894        +1
SeedSequence            5        +1

I picked a random Variable that was not released and printed the reference chain graph.

import random
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Variable')), objgraph.is_proper_module), filename='graph.png')

graph_variable

What did you expect to happen?

I expected the memory used for the Dataset to be released and garbage-collected. I expected the memory in use to plateau instead of grow.

Minimal Complete Verifiable Example

import os
os.system("mamba install --yes objgraph")

import io

import numpy as np
import xarray as xr
from xarray.backends import BackendArray, BackendEntrypoint
from xarray.backends.common import AbstractDataStore
from xarray.backends.file_manager import CachingFileManager
from xarray.backends.locks import SerializableLock, ensure_lock
from xarray.backends.store import StoreBackendEntrypoint
from xarray.core.indexing import (
    IndexingSupport,
    explicit_indexing_adapter,
)
from xarray.core.utils import Frozen, close_on_error


# This backend creates a random 500MB variable and stores it into a Dataset 25 times
class memtest_DataStore(AbstractDataStore):
    """Store for binary data"""

    def __init__(
        self,
        filename,
        lock=None,
        num_bytes=500_000_000,
        **kwargs,
    ):
        if lock is None:
            lock = SerializableLock()
        self.lock = ensure_lock(lock)
        # Create a lock for the file manager
        self._fm_lock = SerializableLock()
        self._manager = CachingFileManager(
            self._open_file,
            filename,
            lock=self._fm_lock,
            kwargs=kwargs,
        )

        self.num_bytes = num_bytes
        self._num_vars = 25
        self._rng = np.random.default_rng()
        self._xr_obj = None

    def _open_file(self, filename, **kwargs):
        b = self._rng.random(self.num_bytes // 8, dtype=np.float64)
        file_obj = io.BytesIO(b, **kwargs)

        return file_obj

    def _create_dataset(self):
        with self._manager.acquire_context(needs_lock=True) as file_obj:
            file_obj.seek(0)
            fltarr = np.frombuffer(file_obj.getvalue(), np.float64)

        xr_obj = xr.Dataset()
        # Create coordinates
        xr_obj["index"] = (
            ("index",),
            range(len(fltarr)),
        )
        # Create several variables so we can see the change
        for i in range(self._num_vars):
            xr_obj[f"var{i}"] = (
                ("index",),
                fltarr,
            )

        return xr_obj

    @property
    def ds(self):
        if not self._xr_obj:
            self._xr_obj = self._create_dataset()

        return self._xr_obj

    def open_store_variable(self, name, var):
        return self.ds[name].variable

    def get_variables(self):
        return Frozen(self.ds.variables)

    def get_attrs(self):
        return Frozen(self.ds.attrs)

    def get_dimensions(self):
        return Frozen(self.ds.dims)

    def close(self):
        self._manager.close()


class memtest_backend(BackendEntrypoint):
    description = "Memory test backend"

    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
        lock=None,
        num_bytes=500_000_000,
    ):
        store = memtest_DataStore(
            filename_or_obj,
            lock=lock,
            num_bytes=num_bytes,
        )

        store_entrypoint = StoreBackendEntrypoint()
        with close_on_error(store):
            ds = store_entrypoint.open_dataset(
                store,
                drop_variables=drop_variables,
            )

        return ds

# Loop to test memory growth
# This results in about 10GB of additional memory used by the end of the process
import xarray as xr
import objgraph

objgraph.show_growth(limit=1)
for i in range(10):
    ds = xr.open_dataset(i, engine=memtest_backend)
    # Just to test that something's still holding on to the data
    del ds
    objgraph.show_growth()
    print('---')

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

function    18610    +18610
list                 15403    +12798
function             25617     +7007
dict                 13365     +6005
tuple                13360     +5284
CompositeUnit         2954     +2954
PrefixUnit            1526     +1526
ReferenceType         4284     +1181
cell                  4966     +1112
type                  2397      +792
getset_descriptor     3374      +628
---
Variable                         50       +25
ReferenceType                  4298       +14
builtin_function_or_method     2537       +10
lock                             34        +4
dict                          13367        +2
KeyedRef                         19        +2
SerializableLock                  7        +2
list                          15404        +1
set                             952        +1
method                          369        +1
---
Variable               75       +25
ReferenceType        4302        +4
lock                   38        +4
dict                13369        +2
KeyedRef               21        +2
SerializableLock        9        +2
list                15405        +1
set                   953        +1
method                370        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        4306        +4
lock                   42        +4
dict                13371        +2
KeyedRef               23        +2
SerializableLock       11        +2
list                15406        +1
set                   954        +1
method                371        +1
SeedSequence            5        +1
---
Variable              125       +25
ReferenceType        4310        +4
lock                   46        +4
dict                13373        +2
KeyedRef               25        +2
SerializableLock       13        +2
list                15407        +1
set                   955        +1
method                372        +1
SeedSequence            6        +1
---
Variable              150       +25
ReferenceType        4314        +4
lock                   50        +4
dict                13375        +2
KeyedRef               27        +2
SerializableLock       15        +2
list                15408        +1
set                   956        +1
method                373        +1
SeedSequence            7        +1
---
Variable              175       +25
ReferenceType        4318        +4
lock                   54        +4
dict                13377        +2
KeyedRef               29        +2
SerializableLock       17        +2
list                15409        +1
set                   957        +1
method                374        +1
SeedSequence            8        +1
---
Variable              200       +25
ReferenceType        4322        +4
lock                   58        +4
dict                13379        +2
KeyedRef               31        +2
SerializableLock       19        +2
list                15410        +1
set                   958        +1
method                375        +1
SeedSequence            9        +1
---
Variable              225       +25
ReferenceType        4326        +4
lock                   62        +4
dict                13381        +2
KeyedRef               33        +2
SerializableLock       21        +2
list                15411        +1
set                   959        +1
method                376        +1
SeedSequence           10        +1
---
Variable              250       +25
ReferenceType        4330        +4
lock                   66        +4
dict                13383        +2
KeyedRef               35        +2
SerializableLock       23        +2
list                15412        +1
set                   960        +1
method                377        +1
SeedSequence           11        +1
---

Anything else we need to know?

This crashes the Binder notebook instance since it uses so much memory.

Environment

INSTALLED VERSIONS

commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 5.14.0-505.el9.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2

xarray: 2024.9.0 pandas: 2.2.3 numpy: 1.26.4 scipy: None netCDF4: 1.7.1 pydap: None h5netcdf: 1.2.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2023.9.3 distributed: 2023.9.3 matplotlib: 3.9.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None

Oct 03 '24 12:10 sappjw

xarray xarray copied to clipboard

Uncontrolled memory growth in custom backend

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

xarray
xarray copied to clipboard