pathml
pathml copied to clipboard
Leaking memory
Describe the bug A clear and concise description of what the bug is.
We believe this issue is similar to https://github.com/Dana-Farber-AIOS/pathml/issues/135#issue-926470357
There seems to be some unmanaged memory issues. Our guess is that this is causing some issues with running the cell segmentation pipeline on the vectra slides. Here’s what we did -
- We created a slide dataset with 220ish Vectra slides
- Created a pipeline with mesmer
- Ran it without specifying the tile size
After about 10 mins, we got the following warnings consistently - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 9.33 GiB -- Worker memory limit: 13.27 GiB
Since we’re on a 54 gig RAM machine, the amount of unmanaged memory / memory leak limits the number of processes that are being run to 4. In contrast, at the begining, there were ~16.
To Reproduce Here is our pipeline. We cannot post data here owing to logistical constraints and regulations.
pipeline = Pipeline([
CollapseRunsVectra(),
SegmentMIF(model='mesmer', nuclear_channel=0, cytoplasm_channel=2, image_resolution=0.5,
gpu=False, postprocess_kwargs_whole_cell=None,
postprocess_kwrags_nuclear=None),
QuantifyMIF('nuclear_segmentation')
])
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context We believe this is causing issues down the line, such as our manifestation of this error faced by other users here https://github.com/Dana-Farber-AIOS/pathml/issues/164#issue-965029522
This issue still persists,
WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: a.bc GiB -- Worker memory limit: de.f GiB
here is the full code:
import sys
from dask.distributed import Client, LocalCluster
sys.path.append('/home/jupyter/Surya/if-analysis')
from imports import *
if __name__ == '__main__':
#Read in the filepaths
src = '/Lung_Cancer/mIF/'
slides_names = [path.as_posix() for path in Path(src).rglob('*.qptiff') if 'HnE' not in path.name]
#Create slides from file paths with a WSI Label for pathology
slides = []
for i in tqdm(slides_names):
#Create slide object
slide = SlideData(i, backend = "bioformats", slide_type = types.Vectra)
slides.append(slide)
#create a Slide Object
slide_dataset = SlideDataset(slides[:5])
#Create a pipeline
pipeline = Pipeline([
CollapseRunsVectra(),
SegmentMIF(model='mesmer', nuclear_channel=0, cytoplasm_channel=2, image_resolution=0.5,
gpu=True, postprocess_kwargs_whole_cell=None,
postprocess_kwrags_nuclear=None),
QuantifyMIF('nuclear_segmentation')
])
cluster = LocalCluster(n_workers=20)
client = Client(cluster)
#Specify where to save the files
write_dir = '/mnt/disks/ip_tiles/ip_slides_pathml/'
#Run and Write
slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True)
Just to add, here is a screenshot of htop
Seems like there's only 1 thread active.
Thanks Surya. I have no idea what is causing this but happy to help look into it! First, we'll need to figure out whether this is caused by a bug in the pathml code itself, and/or by the way you have your Dask cluster configured. Can you start by taking a look at the diagnostics dashboard for your cluster to see if that has any information that will point us in the right direction?
One theory that we have been floating is that because the mesmer model is initialized on each worker, and the model itself is pretty big (lots of parameters), dask might see that memory usage and think that it's unmanaged
While I look at the diagnostics dashboard, I wanted to tell you this- a smaller tile size (I think) didn't kill my program, but took about 8 hours to run on 5 WSIs. A bigger tile size, kills the program, for some reason.
So I'm looking at it right now, and all memory seems to be un-managed. It seems that one needs to be calling .free()
or doing garbage collection on this, and I will look more and post here.
This image shows unmanaged mem per worker (blue = managed, yellow = unmanaged)
It appears that this might be causing only 1 worker to work- disrupting the parallelism.
After some googling I found these resources:
- https://coiled.io/blog/tackling-unmanaged-memory-with-dask/
- https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os
- https://github.com/dask/dask/issues/3530
From my reading of these, the problem seems to be caused at the operating system level, so the high unmanaged memory doesn't necessarily mean that there's a bug in your code or in pathml code.
Can you try out some of the suggestions in those links, e.g. try with MALLOC_TRIM_THRESHOLD_=0
?
MALLOC_TRIM_THRESHOLD_=0
did not fix it.