dask-image
                                
                                 dask-image copied to clipboard
                                
                                    dask-image copied to clipboard
                            
                            
                            
                        Performance issue with `sum`
Working on a modest-size cube, I found that scipy.ndimage.sum is ~100-300x faster than dask_image.ndmeasure.sum_labels
import numpy as np, scipy.ndimage
blah = np.random.randn(19,512,512)
msk = blah > 3
lab, ct = scipy.ndimage.label(msk)
%timeit scipy.ndimage.sum(msk, labels=lab, index=range(1, ct+1))
# 117 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs
rslt = ndmeasure.sum_labels(msk, label_image=lab, index=range(1, ct+1))
rslt
# dask.array<getitem, shape=(6667,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
rslt.compute()
# [########################################] | 100% Completed | 22.9s
Note also that the task creation takes nontrivial time:
%timeit ndmeasure.sum_labels(msk, label_image=lab, index=range(1, ct+1))
# 15.4 s ± 2.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
While I understand that there ought to be some cost to running this processing through a graph with dask, this seems excessively slow. Is there a different approach I should be taking, or is this a bug?
For modest-size cubes, the recommended approach is to use numpy/scipy directly, instead of Dask. This is the first point on the Dask best practices page.
Because all the individual tasks are very fast to run, the overhead involved with Dask becomes overwhelming. First, the task graph itself is very time consuming to build. Then, the scheduler can't efficiently distribute the tasks onto the workers well, so you'll see lots of blank white spaces between teeny tiny tasks on the task stream graph in the dask dashboard.
I've done some very haphazard profiling for the example you give above, and it seems like possibly some of the slowness might be due to slicing. There is work currently being done with high level graphs in dask that might help this problem, but I couldn't say how much that would impact this specific circumstance.
Also, I notice that label_comprehension has a dask.array.stack call in it, which isn't a very efficient operation. So that might also be worth looking into (or I might just be unreasonably suspicious of this one thing)
Thanks for checking. Of course it makes sense that dask would be somewhat slower than scipy for a cube that fits in memory, I was just shocked that it was this big a difference.
The graph building seems to be pretty heavy on its own.  I ran a much smaller example and get the graph below.  It doesn't look unreasonable, but I guess it's costly to assemble, and for the ~6000-label example above, it was very unwieldy.

Thanks for checking. Of course it makes sense that dask would be somewhat slower than scipy for a cube that fits in memory, I was just shocked that it was this big a difference.
That's pretty reasonable - it is a very big difference!
There is a lot of work happening on high level graphs in Dask, to try and reduce how costly it is to build the task graphs (or at least, shift some of that cost over onto the workers instead of the client). I'll bring this use case up as an example of something we could try and tackle after slicing and array overlaps.