pathml Leaking memory

Describe the bug A clear and concise description of what the bug is.

We believe this issue is similar to https://github.com/Dana-Farber-AIOS/pathml/issues/135#issue-926470357

There seems to be some unmanaged memory issues. Our guess is that this is causing some issues with running the cell segmentation pipeline on the vectra slides. Here’s what we did -

We created a slide dataset with 220ish Vectra slides
Created a pipeline with mesmer
Ran it without specifying the tile size

After about 10 mins, we got the following warnings consistently - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 9.33 GiB -- Worker memory limit: 13.27 GiB

Since we’re on a 54 gig RAM machine, the amount of unmanaged memory / memory leak limits the number of processes that are being run to 4. In contrast, at the begining, there were ~16.

To Reproduce Here is our pipeline. We cannot post data here owing to logistical constraints and regulations.

pipeline = Pipeline([
    CollapseRunsVectra(),    
    SegmentMIF(model='mesmer', nuclear_channel=0, cytoplasm_channel=2, image_resolution=0.5, 
               gpu=False, postprocess_kwargs_whole_cell=None, 
               postprocess_kwrags_nuclear=None),
    QuantifyMIF('nuclear_segmentation')   
])

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context We believe this is causing issues down the line, such as our manifestation of this error faced by other users here https://github.com/Dana-Farber-AIOS/pathml/issues/164#issue-965029522

Oct 28 '21 16:10 surya-narayanan

This issue still persists,

WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: a.bc GiB -- Worker memory limit: de.f GiB here is the full code:

import sys
from dask.distributed import Client, LocalCluster

sys.path.append('/home/jupyter/Surya/if-analysis')
from imports import *

if __name__ == '__main__':
  #Read in the filepaths 
  src = '/Lung_Cancer/mIF/'
  slides_names = [path.as_posix() for path in Path(src).rglob('*.qptiff') if 'HnE' not in path.name]  

  #Create slides from file paths with a WSI Label for pathology
  slides = []
  for i in tqdm(slides_names):  
    #Create slide object
    slide = SlideData(i, backend = "bioformats", slide_type = types.Vectra)
    slides.append(slide)

  #create a Slide Object
  slide_dataset = SlideDataset(slides[:5])

  #Create a pipeline
  pipeline = Pipeline([
      CollapseRunsVectra(),    
      SegmentMIF(model='mesmer', nuclear_channel=0, cytoplasm_channel=2, image_resolution=0.5, 
                gpu=True, postprocess_kwargs_whole_cell=None, 
                postprocess_kwrags_nuclear=None),
      QuantifyMIF('nuclear_segmentation')   
      ])

  cluster = LocalCluster(n_workers=20)
  client = Client(cluster)

  #Specify where to save the files
  write_dir = '/mnt/disks/ip_tiles/ip_slides_pathml/'

  #Run and Write
  slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True)

Just to add, here is a screenshot of htop

Seems like there's only 1 thread active.

Feb 08 '22 00:02 surya-narayanan

Thanks Surya. I have no idea what is causing this but happy to help look into it! First, we'll need to figure out whether this is caused by a bug in the pathml code itself, and/or by the way you have your Dask cluster configured. Can you start by taking a look at the diagnostics dashboard for your cluster to see if that has any information that will point us in the right direction?

One theory that we have been floating is that because the mesmer model is initialized on each worker, and the model itself is pretty big (lots of parameters), dask might see that memory usage and think that it's unmanaged

Feb 08 '22 14:02 jacob-rosenthal

While I look at the diagnostics dashboard, I wanted to tell you this- a smaller tile size (I think) didn't kill my program, but took about 8 hours to run on 5 WSIs. A bigger tile size, kills the program, for some reason.

Feb 08 '22 21:02 surya-narayanan

So I'm looking at it right now, and all memory seems to be un-managed. It seems that one needs to be calling .free() or doing garbage collection on this, and I will look more and post here.

This image shows unmanaged mem per worker (blue = managed, yellow = unmanaged)

It appears that this might be causing only 1 worker to work- disrupting the parallelism.

Feb 08 '22 21:02 surya-narayanan

After some googling I found these resources:

https://coiled.io/blog/tackling-unmanaged-memory-with-dask/
https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os
https://github.com/dask/dask/issues/3530

From my reading of these, the problem seems to be caused at the operating system level, so the high unmanaged memory doesn't necessarily mean that there's a bug in your code or in pathml code.

Can you try out some of the suggestions in those links, e.g. try with MALLOC_TRIM_THRESHOLD_=0 ?

Feb 08 '22 22:02 jacob-rosenthal

MALLOC_TRIM_THRESHOLD_=0 did not fix it.

Feb 08 '22 23:02 surya-narayanan

pathml pathml copied to clipboard

Leaking memory

pathml
pathml copied to clipboard