pathml
pathml copied to clipboard
Parallelizing tiling with empty pipeline
Hi,
I wanted to create a new issue to hear your opinions on my reply to #218, which was closed.
I was able to parallelize running an empty pipeline, which I understand should not befit from dask distributed (https://github.com/Dana-Farber-AIOS/pathml/issues/218#issuecomment-956591111). Here's what I did, I wonder if you could re-frame this within the dask framework to achieve parallel processing, even with an empty pipeline?
def run_and_write(slide_name):
slide = SlideData(slide_name, backend = "bioformats", slide_type = types.Vectra)
pipeline = Pipeline([])
slide.run(pipeline, distributed=False)
slide.write(f'/path_to_slide/{slide.name}.h5')
p_umap(run_and_write, slides_names)
Where p_umap
is a multi-processing function that distributes the function run_and_write
and the members of the slides_names
list to different workers. So my speedup was proportional to the number of workers I had.
I'm curious on your thoughts on this.
Originally posted by @surya-narayanan in https://github.com/Dana-Farber-AIOS/pathml/issues/218#issuecomment-987297664
Thanks! So this slide-level distribution is different from how we've set things up based around tile-level distribution. Any suggestions for how we could integrate this into the PathML API?
It looks like p_umap is built on pathos, which I don't have experience with. Is there a reason you used that and not dask?
This could be out of my depth, but just to start a discussion, we could replicate this behavior
big_future = client.scatter(tile)
at the slide level, in slide_dataset.run
https://github.com/Dana-Farber-AIOS/pathml/blob/b2eca9ed02e990ace16f3cb7f23b16828e12cc19/pathml/core/slide_dataset.py#L56
where instead of looping over slides, we could scatter them to a dask distributed client.
As to why p_umap and not dask, I don't have a good answer- just a package I was more used to (and in my opinion, very easy to use out of the box.)
Yep, I agree, it seems like it would make most sense as a new method for SlideDataset
.
Instead of submitting pipeline.apply(tile)
jobs to the cluster, you would instead be submitting SlideData.run(pipeline)
jobs. I think you'd need to be make sure that distributed=False
within the run()
function calls, otherwise each one will try to spin up its own cluster.
I think it would be best to use dask if possible, so that we minimize the number of external dependencies that we need to support for installation and troubleshooting. The new method could also use the same API, where you pass a pipeline and a dask Client and other parameters
Does this fit with what you had in mind? thanks
Yes, that makes sense. I'm happy to take this up.