zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Memory leak when saving in parallel

Open alejoe91 opened this issue 3 years ago • 5 comments

Zarr version

v2.12.0

Numcodecs version

v0.10.2

Python Version

3.9.13

Operating System

Linux

Installation

using pip (in a conda environment)

Description

Hi Zarr team!

We use the ProcessPoolExecutor to distribute writing to different Zarr chunks over different jobs (see here)

The function that each job runs simply loads a buffer and saves it to the appropriate zarr array location. We noticed, that the buffer that is saved to zarr (traces) is not properly garbage collected and this makes the RAM usage grow as the process continues. We patched it on our side by forcing a garbace collection in-place.

We think this issue is related to Zarr because we have the exact same mecanism to write to binary files and in that case the RAM usage is what we expect.

Steps to reproduce

The problem can be reproduced using SpikeInterface v0.94.0:

>>> pip install spikeinterface[full]==0.94.0

Here is a sample script to reproduce the issue:

import spikeinterface.full as si

# generate a sample recording
recording, _ = si.toy_example(num_channels=64, duration=600, num_segments=1)

# save it to zarr with parallization (by default it will use blosc-zstd)
recording_zarr = recording.save(format="zarr", zarr_path="test_ram.zarr", n_jobs=4, total_memory="500M",
                                                     progress_bar=True)

Note that the chunk size is adjusted so that the number of jobs times the memory needed by each chunk is ~500MB (total_memory).

Additional output

No response

alejoe91 avatar Sep 09 '22 08:09 alejoe91

Thanks for the write up, @alejoe91. I don't see anything immediately surprising in https://github.com/SpikeInterface/spikeinterface/blob/master/spikeinterface/core/core_tools.py#L635-L709 which definitely makes me worry. Could you help us understand what zarr-level calls are being made in a typical run?

joshmoore avatar Sep 09 '22 08:09 joshmoore

@joshmoore sure, here is how the zarr calls are distributed:

  1. The save() function is routed to this _save() function which is specific to Zarr. Here the zarr file is created and several groups and small datasets are added to it.
  2. The write_traces_to_zarr does the actual writing.

I hope this makes it clearer!

alejoe91 avatar Sep 09 '22 09:09 alejoe91

Thanks for the explanation, @alejoe91. Unfortunately, I'm don't think I'm going to be able to get to a reproducible example from your description. Would it be possible to extract the Zarr code or to dump your process' memory after a few iterations so we can pinpoint what's leaking?

joshmoore avatar Sep 12 '22 15:09 joshmoore

@joshmoore I'll try to print out RAM usage with and without forcing GC. I'm a bit busy these couple of days. Planning to do it early next week.

alejoe91 avatar Sep 15 '22 04:09 alejoe91

Hi @joshmoore

Sorry for the delay in getting back to you.

I tested on my local machine and it seems that RAM usage is under control and it doesn't grow as reported here.

I initially encountered the issue using a cloud resource from GCP, so that might be the issue. I'll repeat the test there to see if that architecture is triggering the abnormal RAM consumption.

Here is a log which prints the start_frame, end_frame, and RAM usage at each iteration (using 4 jobs, 1s chunk size). I'll provide the same log when running on GCP in a few days.

zarr_garbage_log.txt

alejoe91 avatar Sep 22 '22 16:09 alejoe91

@alejoe91 when you encountered the initial issue, were you storing the Zarr on GCS?

If so, I think I'm running into the same thing... In my case, data I've written to a Zarr group that uses gcsfs isn't garbage collected.

johnurbanik avatar Dec 01 '22 20:12 johnurbanik

cc @martindurant

jakirkham avatar Dec 01 '22 21:12 jakirkham

Sounds like someone needs to run pympler? I don't have any immediate ideas why gcsfs should be holding on to references.

martindurant avatar Dec 02 '22 15:12 martindurant