TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

Accumulation of memory usage with large number of dense array queries

Open adamrboxall opened this issue 7 months ago • 4 comments

Hi, I'm hoping someone will be able to help us resolve a memory issue with TileDB-Py...

When running a large number of queries against a TileDB dense array, we are seeing an accumulation of memory usage. This can run into several GBs and eventually causes an out of memory error.

It seems periodically re-instantiating the TileDB context object leads to controlled memory usage - this suggests memory is held by the context object and correctly released when the context object is garbage collected. Unfortunately, adding this logic to periodically re-instantiate the TileDB context object isn't a practical solution for our more complex scripts (i.e. with multiple open arrays in PyTorch Datasets, requirement to track the number of queries when re-instantiating the context etc.), so we're hoping we can resolve this issue.

This occurs even when sm.tile_cache_size is set to 0.

This may be related to #150 or #440.

Please see a reproducible example below.

Thanks everyone for your help!


Python version:

Python 3.10.12

Python environment:

numpy==2.2.5
packaging==25.0
psutil==7.0.0
tiledb==0.34.0

Create a test array:

import numpy as np
import tiledb
import os
import psutil
import datetime

x = np.ones(10000000)
ctx = tiledb.default_ctx({"sm.tile_cache_size": 0, "sm.io_concurrency_level": 1, "sm.compute_concurrency_level": 1})
path = 'test_tile_db'
d1 = tiledb.Dim(
    'test_domain', domain=(0, x.shape[0] - 1), tile=10000, dtype="uint32"
)
domain = tiledb.Domain(d1)
v = tiledb.Attr(
    'test_value',
    dtype="float32",
)
schema = tiledb.ArraySchema(
    domain=domain, attrs=(v,), cell_order="row-major", tile_order="row-major"
)
A = tiledb.DenseArray.create(path, schema)
values = x.astype(np.float32)
with tiledb.DenseArray(path, mode="w", ctx=ctx) as A:
    A[:] = {'test_value': values}

Run a large number of queries and track memory usage:

ctx = tiledb.Ctx({"sm.tile_cache_size": 0, "sm.io_concurrency_level": 1, "sm.compute_concurrency_level": 1})
data = tiledb.open(path, mode='r', ctx=ctx)

for i in range(100000):
    array = data[0]
    if i % 10000 == 0:
        process = psutil.Process(os.getpid())
        ram_usage = process.memory_info().rss / 1e6
        print(datetime.datetime.now(), ram_usage, 'MB', 'after', i, 'queries')
2025-05-15 10:37:23.794463 157.769728 MB after 0 queries
2025-05-15 10:37:38.979566 283.578368 MB after 10000 queries
2025-05-15 10:37:52.696840 413.016064 MB after 20000 queries
2025-05-15 10:38:06.442493 542.4128 MB after 30000 queries
2025-05-15 10:38:19.806429 671.66208 MB after 40000 queries
2025-05-15 10:38:33.460704 801.017856 MB after 50000 queries
2025-05-15 10:38:47.761755 930.164736 MB after 60000 queries
2025-05-15 10:39:02.624914 1059.328 MB after 70000 queries
2025-05-15 10:39:17.331571 1189.036032 MB after 80000 queries
2025-05-15 10:39:33.014144 1318.25664 MB after 90000 queries

Controlled memory usage when periodically re-instantiating the context object:

ctx = tiledb.Ctx({"sm.tile_cache_size": 0, "sm.io_concurrency_level": 1, "sm.compute_concurrency_level": 1})
data = tiledb.open(path, mode='r', ctx=ctx)

for i in range(100000):
    
    array = data[0]

    if i % 10000 == 0:
        ctx = tiledb.Ctx({"sm.tile_cache_size": 0, "sm.io_concurrency_level": 1, "sm.compute_concurrency_level": 1})
        data = tiledb.open(path, mode='r', ctx=ctx)
        process = psutil.Process(os.getpid())  
        ram_usage = process.memory_info().rss / 1e6  
        print(datetime.datetime.now(), ram_usage, 'MB', 'after', i, 'queries') 
2025-05-15 10:41:44.234509 161.562624 MB after 0 queries
2025-05-15 10:41:57.731925 290.267136 MB after 10000 queries
2025-05-15 10:42:10.923391 290.267136 MB after 20000 queries
2025-05-15 10:42:24.230450 290.312192 MB after 30000 queries
2025-05-15 10:42:37.653962 290.033664 MB after 40000 queries
2025-05-15 10:42:51.223860 284.045312 MB after 50000 queries
2025-05-15 10:43:04.061728 285.372416 MB after 60000 queries
2025-05-15 10:43:17.623785 284.352512 MB after 70000 queries
2025-05-15 10:43:31.396821 283.860992 MB after 80000 queries
2025-05-15 10:43:45.231651 284.598272 MB after 90000 queries

adamrboxall avatar May 15 '25 11:05 adamrboxall

Hi @adamrboxall, could you please take a look at this comment and see if that config setting / explanation clarifies the situation?

ihnorton avatar May 28 '25 03:05 ihnorton

Thanks very much @ihnorton for your help! From this comment I understand sm.mem.malloc_trim should be enabled by default? I've just tried manually setting this with ctx = tiledb.Ctx({"sm.mem.malloc_trim": True}) and still see the same behaviour. Reading https://github.com/TileDB-Inc/TileDB/pull/2443#issue-971181062, am I correct in understanding malloc_trim would only be called on context destruction?

adamrboxall avatar Jun 02 '25 11:06 adamrboxall

am I correct in understanding malloc_trim would only be called on context destruction?

It is also called on Query destruction (https://github.com/TileDB-Inc/TileDB/blob/841dcbbeca64fd42f2f0b406e094c448445a3334/tiledb/sm/storage_manager/storage_manager.cc#L79).

ihnorton avatar Jun 13 '25 19:06 ihnorton

OK, thanks @ihnorton. With this in mind, what do you think may still be leading to this accumulating memory usage? Or are there any workarounds you could recommend that would avoid high memory usage?

As far as I understand, in the code above a query object should be created and destroyed implicitly, and therefore malloc_trim should be called and memory associated with the query object should be freed. Is it possible memory associated with the query object is not being freed, or that accumulating memory is associated with the context rather than the query? Perhaps I'm misunderstanding!

I see comparable memory usage patterns (i.e. accumulating usage with a large number of queries) regardless of whether I specify "sm.mem.malloc_trim": True or "sm.mem.malloc_trim": False in configuration. I also see this issue when running the code in https://github.com/TileDB-Inc/TileDB-Py/issues/859#issuecomment-2576008714, but your comment suggests this should have been fixed? If this was a recent fix, could this be a version issue when installing TileDB with PyPi? (in the above example I am using 0.34.0)

adamrboxall avatar Jun 23 '25 11:06 adamrboxall