cucim icon indicating copy to clipboard operation
cucim copied to clipboard

No performance gain with Regionprops_table[BUG]

Open rjesud opened this issue 2 years ago • 12 comments

Hi, Firstly, thank you for developing a great tool! It is proving to be very useful for a number of my projects.

I am encountering an issue when running regionprops_table. I do not see any performance gain when executing the cucim GPU regionprops_table vs. the normal CPU scikit image function. In fact, it appears that the cucim implementation runs slower.

image

The label array is 32bit but I also tested this on 16bit label images and I see the same behavior.

I can confirm cucim is running as expected since I can execute the vesselness example notebook and I can see the performance gain: https://github.com/rapidsai/cucim/blob/branch-22.04/notebooks/vesselness_example.ipynb

I am using: cucim 22.2.0 cupy-cuda112 10.2.0

rjesud avatar Mar 23 '22 21:03 rjesud

Thanks for bringing this up @rjesud. regionprops is one area that is currently known to not be accelerated well in cuCIM. What are the typical "areas" of the regions in your application? We will have to think of a better way to refactor the regionprops calculations.

There is currently GPU kernel launch overhead that occurs on a per-region and per-property basis, so unless each region is quite large, the CPU will be faster. There are also some similar region-based measurement functions in cupyx.scipy.ndimage. You could try those as an alternative, but I think they have basically the same limitation.

Most likely what is needed to see more benefit is to group computation of multiple properties into a single GPU kernel call, but that would involve some fairly large refactoring of the regionprops code. I think this is a fairly popular functionality of the scikit-image API, though, so it is worth thinking about strategies to accelerate it.

grlee77 avatar Mar 23 '22 22:03 grlee77

Hi @grlee77, Thank you for the response and explanation. This certainly explains the performance issue with my use case. I am working with many small regions. In some images these regions can number up to 500k. They are typically round, touching objects with diameter of 10-30 pixels. This would be the typical scenario for whole slide digital pathology single-cell segmentation analysis. There are more and more single-cell segmentation tools like Stardist and Cellpose which offer GPU accelerated whole slide segmentation but I have yet to find a performant single-cell feature extraction function to match.

rjesud avatar Mar 24 '22 02:03 rjesud

@gigony, should we add a performance label? I wouldn't classify this as a bug in that the output is correct, it is just slower than desired.

grlee77 avatar Mar 24 '22 02:03 grlee77

@grlee77 agree. I don't think it is a bug. Let's add performance tag for this.

gigony avatar Mar 24 '22 03:03 gigony

Thanks, All!

rjesud avatar Mar 24 '22 15:03 rjesud

In the specific case of "area", since this is just the number of pixels, I think you can use cupy.bincount to more efficiently get all of the areas in one call.

grlee77 avatar Mar 24 '22 15:03 grlee77

Here is a function that should work specifically for areas and centroids. It assumes the background has label value 0 and all other integer values are labeled regions. It is an n-dimensional extension of this stackoverflow answer which pointed out the key idea in that there is a weights option to bincount.

The drawback of this approach is that it will have some memory overhead due to creation of the coordinate grid of size (labels.size * labels.ndim). It will likely be much faster than the current regionprops implementation, here, though. It is around 6x faster than the CPU on my system (Intel Skylake-X CPU, GTX 1080 Ti GPU) for a 2D labels image of shape (2048, 2048) with ~300 distinct labels.

import cupy as cp

def areas_and_centroids(labels):
    grid = cp.meshgrid(*tuple(cp.arange(s) for s in labels.shape),
                       indexing='ij')
    labels = labels.ravel()

    # index starting from 1 in the output of bincount to exclude the background
    areas = cp.bincount(labels)[1:]
    centroids = cp.stack(
        [cp.bincount(labels, weights=g.ravel())[1:] for g in grid],
        axis=-1
    )
    centroids /= areas[:, cp.newaxis]
    return areas, centroids

areas will be 1D array with entries for labels 1 through n_label centroids will be a 2D array of shape (n_label, labels.ndim)

grlee77 avatar Mar 24 '22 17:03 grlee77

@grlee77 Thank you very much! This moves us much closer to goal. The other feature I am interested in is mean_intensity. Do you have any suggestion for a work around?

rjesud avatar Mar 25 '22 16:03 rjesud

The other feature I am interested in is mean_intensity. Do you have any suggestion for a work around?

Can you try cupyx.scipy.ndimage.mean and see if it is sufficient? (or if it is faster, cupyx.scipy.ndimage.sum and then divide by the already computed areas from above to get the mean)

grlee77 avatar Mar 25 '22 18:03 grlee77

Thanks for the suggestion! Some preliminary runs show that there is some performance gain with cupyx.scipy.ndimage.mean. However, it appears sporadic.

image

In other runs, I see it take anywhere from 4 to 20 seconds. I am not sure what causes this difference. Is there some sort of cache I should be clearing?

another run: image

r

rjesud avatar Mar 29 '22 01:03 rjesud

@grlee77 any idea on @rjesud 's comment above?

gigony avatar Apr 05 '22 18:04 gigony

In other runs, I see it take anywhere from 4 to 20 seconds. I am not sure what causes this difference. Is there some sort of cache I should be clearing?

Do you mean for cupyx.scipy.ndimage.mean? There should be overhead the very first time it is used as the kernels get compiled at runtime. After that, the compiled kernels get loaded from memory (or from disk if the session has been restarted).

To improve the accuracy you should technically get the device prior to starting timing via

dev = cupy.cuda.Device()

and then insert

dev.synchronize()

immediately after the function you are benchmarking. This will make sure kernel execution has fully completed before time() is called. (CUDA kernels are launched asynchronously and may return control to the host before they actually complete).

Alternatively, CuPy has a utility called repeat that can be used for this purpose and takes care of the synchronization for you. For relatively long kernel runs as seen here, you will want to change the large default number of repetitions to something much smaller like n_warmup=1, n_repeat=5. This would call the kernel once as a dry run and then average the duration of the next 5 runs.

grlee77 avatar Apr 05 '22 18:04 grlee77