pathml Parallelizing tiling with empty pipeline

Hi,

I wanted to create a new issue to hear your opinions on my reply to #218, which was closed.

I was able to parallelize running an empty pipeline, which I understand should not befit from dask distributed (https://github.com/Dana-Farber-AIOS/pathml/issues/218#issuecomment-956591111). Here's what I did, I wonder if you could re-frame this within the dask framework to achieve parallel processing, even with an empty pipeline?

def run_and_write(slide_name):
  slide = SlideData(slide_name, backend = "bioformats", slide_type = types.Vectra)
  pipeline = Pipeline([])
  slide.run(pipeline, distributed=False)
  slide.write(f'/path_to_slide/{slide.name}.h5')

p_umap(run_and_write, slides_names)

Where p_umap is a multi-processing function that distributes the function run_and_write and the members of the slides_names list to different workers. So my speedup was proportional to the number of workers I had.

I'm curious on your thoughts on this.

Originally posted by @surya-narayanan in https://github.com/Dana-Farber-AIOS/pathml/issues/218#issuecomment-987297664

Jan 27 '22 21:01 surya-narayanan

Thanks! So this slide-level distribution is different from how we've set things up based around tile-level distribution. Any suggestions for how we could integrate this into the PathML API?

It looks like p_umap is built on pathos, which I don't have experience with. Is there a reason you used that and not dask?

Jan 27 '22 22:01 jacob-rosenthal

This could be out of my depth, but just to start a discussion, we could replicate this behavior

big_future = client.scatter(tile)

at the slide level, in slide_dataset.run

https://github.com/Dana-Farber-AIOS/pathml/blob/b2eca9ed02e990ace16f3cb7f23b16828e12cc19/pathml/core/slide_dataset.py#L56

where instead of looping over slides, we could scatter them to a dask distributed client.

As to why p_umap and not dask, I don't have a good answer- just a package I was more used to (and in my opinion, very easy to use out of the box.)

Feb 01 '22 16:02 surya-narayanan

Yep, I agree, it seems like it would make most sense as a new method for SlideDataset.

Instead of submitting pipeline.apply(tile) jobs to the cluster, you would instead be submitting SlideData.run(pipeline) jobs. I think you'd need to be make sure that distributed=False within the run() function calls, otherwise each one will try to spin up its own cluster.

I think it would be best to use dask if possible, so that we minimize the number of external dependencies that we need to support for installation and troubleshooting. The new method could also use the same API, where you pass a pipeline and a dask Client and other parameters

Does this fit with what you had in mind? thanks

Feb 01 '22 19:02 jacob-rosenthal

Yes, that makes sense. I'm happy to take this up.

Feb 01 '22 22:02 surya-narayanan

pathml pathml copied to clipboard

Parallelizing tiling with empty pipeline

pathml
pathml copied to clipboard