zarr-python
zarr-python copied to clipboard
Memory leak when saving in parallel
Zarr version
v2.12.0
Numcodecs version
v0.10.2
Python Version
3.9.13
Operating System
Linux
Installation
using pip (in a conda environment)
Description
Hi Zarr team!
We use the ProcessPoolExecutor to distribute writing to different Zarr chunks over different jobs (see here)
The function that each job runs simply loads a buffer and saves it to the appropriate zarr array location. We noticed, that the buffer that is saved to zarr (traces) is not properly garbage collected and this makes the RAM usage grow as the process continues. We patched it on our side by forcing a garbace collection in-place.
We think this issue is related to Zarr because we have the exact same mecanism to write to binary files and in that case the RAM usage is what we expect.
Steps to reproduce
The problem can be reproduced using SpikeInterface v0.94.0:
>>> pip install spikeinterface[full]==0.94.0
Here is a sample script to reproduce the issue:
import spikeinterface.full as si
# generate a sample recording
recording, _ = si.toy_example(num_channels=64, duration=600, num_segments=1)
# save it to zarr with parallization (by default it will use blosc-zstd)
recording_zarr = recording.save(format="zarr", zarr_path="test_ram.zarr", n_jobs=4, total_memory="500M",
progress_bar=True)
Note that the chunk size is adjusted so that the number of jobs times the memory needed by each chunk is ~500MB (total_memory).
Additional output
No response
Thanks for the write up, @alejoe91. I don't see anything immediately surprising in https://github.com/SpikeInterface/spikeinterface/blob/master/spikeinterface/core/core_tools.py#L635-L709 which definitely makes me worry. Could you help us understand what zarr-level calls are being made in a typical run?
@joshmoore sure, here is how the zarr calls are distributed:
- The
save()function is routed to this_save()function which is specific to Zarr. Here the zarr file is created and several groups and small datasets are added to it. - The
write_traces_to_zarrdoes the actual writing.- we create the large datasets making sure that the chunk size matches the chunk sizes that the parallel processing uses (so we write to non-overlapping blocks)
- the parallel processing is carried out by the
ChunkRecordingExecutorclass, which internally uses the built-in ProcessPoolExecutor - each job is initialized with an
_init_func, which allows us to do operations that are only required once (e.g., we reopen the Zarr object and store the datasets that needs writing in acontext) - the _write_zarr_chunk is then run for each chunk: it retrieves the data that need to be written and writes them to the zarr dataset
I hope this makes it clearer!
Thanks for the explanation, @alejoe91. Unfortunately, I'm don't think I'm going to be able to get to a reproducible example from your description. Would it be possible to extract the Zarr code or to dump your process' memory after a few iterations so we can pinpoint what's leaking?
@joshmoore I'll try to print out RAM usage with and without forcing GC. I'm a bit busy these couple of days. Planning to do it early next week.
Hi @joshmoore
Sorry for the delay in getting back to you.
I tested on my local machine and it seems that RAM usage is under control and it doesn't grow as reported here.
I initially encountered the issue using a cloud resource from GCP, so that might be the issue. I'll repeat the test there to see if that architecture is triggering the abnormal RAM consumption.
Here is a log which prints the start_frame, end_frame, and RAM usage at each iteration (using 4 jobs, 1s chunk size). I'll provide the same log when running on GCP in a few days.
@alejoe91 when you encountered the initial issue, were you storing the Zarr on GCS?
If so, I think I'm running into the same thing... In my case, data I've written to a Zarr group that uses gcsfs isn't garbage collected.
cc @martindurant
Sounds like someone needs to run pympler? I don't have any immediate ideas why gcsfs should be holding on to references.