dask-image icon indicating copy to clipboard operation
dask-image copied to clipboard

Performance issue with `sum`

Open keflavich opened this issue 4 years ago • 4 comments

Working on a modest-size cube, I found that scipy.ndimage.sum is ~100-300x faster than dask_image.ndmeasure.sum_labels

import numpy as np, scipy.ndimage
blah = np.random.randn(19,512,512)
msk = blah > 3
lab, ct = scipy.ndimage.label(msk)
%timeit scipy.ndimage.sum(msk, labels=lab, index=range(1, ct+1))
# 117 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

vs

rslt = ndmeasure.sum_labels(msk, label_image=lab, index=range(1, ct+1))
rslt
# dask.array<getitem, shape=(6667,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
rslt.compute()
# [########################################] | 100% Completed | 22.9s

Note also that the task creation takes nontrivial time:

%timeit ndmeasure.sum_labels(msk, label_image=lab, index=range(1, ct+1))
# 15.4 s ± 2.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

While I understand that there ought to be some cost to running this processing through a graph with dask, this seems excessively slow. Is there a different approach I should be taking, or is this a bug?

keflavich avatar Jun 01 '21 22:06 keflavich

For modest-size cubes, the recommended approach is to use numpy/scipy directly, instead of Dask. This is the first point on the Dask best practices page.

Because all the individual tasks are very fast to run, the overhead involved with Dask becomes overwhelming. First, the task graph itself is very time consuming to build. Then, the scheduler can't efficiently distribute the tasks onto the workers well, so you'll see lots of blank white spaces between teeny tiny tasks on the task stream graph in the dask dashboard.

I've done some very haphazard profiling for the example you give above, and it seems like possibly some of the slowness might be due to slicing. There is work currently being done with high level graphs in dask that might help this problem, but I couldn't say how much that would impact this specific circumstance.

Also, I notice that label_comprehension has a dask.array.stack call in it, which isn't a very efficient operation. So that might also be worth looking into (or I might just be unreasonably suspicious of this one thing)

GenevieveBuckley avatar Jun 02 '21 07:06 GenevieveBuckley

Thanks for checking. Of course it makes sense that dask would be somewhat slower than scipy for a cube that fits in memory, I was just shocked that it was this big a difference.

The graph building seems to be pretty heavy on its own. I ran a much smaller example and get the graph below. It doesn't look unreasonable, but I guess it's costly to assemble, and for the ~6000-label example above, it was very unwieldy. image

keflavich avatar Jun 02 '21 11:06 keflavich

Thanks for checking. Of course it makes sense that dask would be somewhat slower than scipy for a cube that fits in memory, I was just shocked that it was this big a difference.

That's pretty reasonable - it is a very big difference!

There is a lot of work happening on high level graphs in Dask, to try and reduce how costly it is to build the task graphs (or at least, shift some of that cost over onto the workers instead of the client). I'll bring this use case up as an example of something we could try and tackle after slicing and array overlaps.

GenevieveBuckley avatar Jun 04 '21 09:06 GenevieveBuckley