zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Memory Leak in Sharded Zarr Indexing

Open rm1113 opened this issue 6 months ago • 2 comments

Zarr version

v3.0.8

Numcodecs version

v0.16.1

Python Version

3.12

Operating System

Linux (WSL2)

Installation

pip

Description

The RAM consumption when reading a local Zarr array created with the shards option depends on the indexing method. Selecting the same data via a slice or via a list of indices yields dramatically different memory usage. If I omit the shards parameter when calling create_array, the memory usage for both selection methods becomes similar.

drawing drawing

Steps to reproduce


import zarr
import numpy as np

N_ROWS = 10_000
N_COLS = 2_000
CHUNK_ROW = 5_000
CHUNK_COL = 100
ARRAY_PATH = "/tmp/tmp.zarr"

# create array as
rng = np.random.default_rng(seed=42)
x = rng.random((N_ROWS, N_COLS))

array = zarr.create_array(
    store=ARRAY_PATH,
    data=x,
    chunks=(CHUNK_ROW, CHUNK_COL),
    shards=(CHUNK_ROW, 500),
    overwrite=True,
)

# read data
array = zarr.open_array(ARRAY_PATH)

x = array[slice(0, CHUNK_ROW)]    # RAM consumption 50 MiB 
y = array[list(range(CHUNK_ROW))] # RAM consumption 800 MiB 

Additional output

No response

rm1113 avatar Jun 23 '25 12:06 rm1113

I've found that RAM usage depends on concurrency settings:

Image

Maximum RAM usage stops growing after 4, because there are 4 shards to read.

So far, I assume that each thread creates its own CoordinateIndexer entity, which is quite heavy in cases where a large number of indexes need to be selected.

rm1113 avatar Jun 25 '25 14:06 rm1113

thanks for this detective work @rm1113, I'm not very familiar with the array indexing side of things but maybe @normanrz or @dcherian have some ideas here

d-v-b avatar Jun 26 '25 08:06 d-v-b