ArcticDB icon indicating copy to clipboard operation
ArcticDB copied to clipboard

Unexpected memory retention when reading slices of dataframes

Open rmlynx opened this issue 6 months ago • 3 comments

Describe the bug

When reading and slicing a subset of a large DataFrame:

  1. The entire DataFrame appears to be loaded into memory.
  2. A slice is taken and returned, likely as a view retaining a reference to the original.
  3. If this operation is repeated in a loop and each slice is stored (e.g., in a list), the original large DataFrames are never deallocated.

This causes cumulative memory usage to increase continuously, eventually leading to an out-of-memory crash.

Steps/Code to Reproduce

1. Create demo data

import pandas as pd
import numpy as np
from datetime import datetime


def generate_random_dataframe(n_rows=25, n_cols=10, start_date="2000-01-01", freq="D"):
    """
    Generate a random DataFrame with a datetime index.

    Args:
        n_rows (int): Number of rows in the DataFrame.
        n_cols (int): Number of columns in the DataFrame.
        start_date (str): Start date for the datetime index.
        freq (str): Frequency for the datetime index (e.g., 'D' for daily, 'H' for hourly).

    Returns:
        pd.DataFrame: A random DataFrame with datetime as the index.
    """
    # Generate column names
    cols = [f"COL_{i}" for i in range(n_cols)]

    # Generate random data
    data = np.random.randint(0, 100, size=(n_rows, n_cols))

    # Create a datetime index
    index = pd.date_range(start=start_date, periods=n_rows, freq=freq)

    # Create the DataFrame
    df = pd.DataFrame(data, columns=cols, index=index)

    return df

Write a 20 years times 10 000 columns DataFrame with random data to Arctic:

df = generate_random_dataframe(
    n_rows=255 * 20, n_cols=10_000, start_date="1990-01-01", freq="D"
)
df.head()
import arcticdb as adb

uri = "lmdb://tmp/arcticdb_leak"
ac = adb.Arctic(uri)

library = ac.get_library("demo_lib", create_if_missing=True)

library.write("test_frame", df)

2. Read Data

Helper function to get memory usage of a list of DataFrames:

def get_total_dataframe_size_gb(df_list):
    """
    Calculate total memory usage of a list of DataFrames in gigabytes.

    Parameters:
        df_list (list of pd.DataFrame): List of DataFrames.

    Returns:
        float: Total size in GB.
    """
    total_bytes = sum(df.memory_usage(deep=True).sum() for df in df_list)
    return total_bytes / (1024**3)

Read the full dataframe for reference:

from_storage_df = library.read("test_frame").data

print(f"Shape of data: {from_storage_df.shape}")
print(f"Size of data: {get_total_dataframe_size_gb([from_storage_df]):.2f} GB")

Read only a slice:

n_rows_to_read = 3

from_date = from_storage_df.index[1]
to_date = from_storage_df.index[n_rows_to_read]

small_df = library.read(
    "test_frame",
    date_range=(from_date, to_date),
).data

print(f"Shape of fetched subset of data: {small_df.shape}")
print(
    f"Size of fetched subset of data: {get_total_dataframe_size_gb([small_df]):.4f} GB"
)

Now read the small slice in a loop and save results in a list:

retrieved_data = []
n_times_to_fetch = 100

for i in range(n_times_to_fetch):
    small_df = library.read(
        "test_frame",
        date_range=(
            from_date,
            to_date,
        ),
    ).data
    retrieved_data.append(small_df)

    if i % 10 == 0:
        print(f"Fetched small subset {i} times")
        print(
            f"    Total size of retrieved data so far: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
        )

print()
print(
    f"Total size of retrieved data: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
)

Expected Results

Memory usage should in total increase by ca 0.0002 GB * 300 = 0.06 GB -- instead, it increases by several hundred MB per iteration (until out of memory).

Image

OS, Python Version and ArcticDB Version

  • Linux
  • Python 3.10.12
  • ArcticDB 5.2.3

Backend storage used

LMDB

Additional Context

We're able to bypass the problem using adb.QueryBuilder().date_range((from_date, to_date)) as an argument instead to library.read, but it's not clear if this is the intended way to do it and why the most obvious way to read a slice of a DataFrame would cause this memory leak.

rmlynx avatar May 09 '25 08:05 rmlynx