cucim
cucim copied to clipboard
No performance gain with Regionprops_table[BUG]
Hi, Firstly, thank you for developing a great tool! It is proving to be very useful for a number of my projects.
I am encountering an issue when running regionprops_table. I do not see any performance gain when executing the cucim GPU regionprops_table vs. the normal CPU scikit image function. In fact, it appears that the cucim implementation runs slower.
The label array is 32bit but I also tested this on 16bit label images and I see the same behavior.
I can confirm cucim is running as expected since I can execute the vesselness example notebook and I can see the performance gain: https://github.com/rapidsai/cucim/blob/branch-22.04/notebooks/vesselness_example.ipynb
I am using:
cucim 22.2.0
Thanks for bringing this up @rjesud. regionprops
is one area that is currently known to not be accelerated well in cuCIM. What are the typical "areas" of the regions in your application? We will have to think of a better way to refactor the regionprops calculations.
There is currently GPU kernel launch overhead that occurs on a per-region and per-property basis, so unless each region is quite large, the CPU will be faster. There are also some similar region-based measurement functions in cupyx.scipy.ndimage
. You could try those as an alternative, but I think they have basically the same limitation.
Most likely what is needed to see more benefit is to group computation of multiple properties into a single GPU kernel call, but that would involve some fairly large refactoring of the regionprops code. I think this is a fairly popular functionality of the scikit-image API, though, so it is worth thinking about strategies to accelerate it.
Hi @grlee77, Thank you for the response and explanation. This certainly explains the performance issue with my use case. I am working with many small regions. In some images these regions can number up to 500k. They are typically round, touching objects with diameter of 10-30 pixels. This would be the typical scenario for whole slide digital pathology single-cell segmentation analysis. There are more and more single-cell segmentation tools like Stardist and Cellpose which offer GPU accelerated whole slide segmentation but I have yet to find a performant single-cell feature extraction function to match.
@gigony, should we add a performance
label? I wouldn't classify this as a bug in that the output is correct, it is just slower than desired.
@grlee77 agree. I don't think it is a bug. Let's add performance tag for this.
Thanks, All!
In the specific case of "area", since this is just the number of pixels, I think you can use cupy.bincount
to more efficiently get all of the areas in one call.
Here is a function that should work specifically for areas and centroids. It assumes the background has label value 0 and all other integer values are labeled regions. It is an n-dimensional extension of this stackoverflow answer which pointed out the key idea in that there is a weights
option to bincount
.
The drawback of this approach is that it will have some memory overhead due to creation of the coordinate grid of size (labels.size * labels.ndim)
. It will likely be much faster than the current regionprops implementation, here, though. It is around 6x faster than the CPU on my system (Intel Skylake-X CPU, GTX 1080 Ti GPU) for a 2D labels image of shape (2048, 2048) with ~300 distinct labels.
import cupy as cp
def areas_and_centroids(labels):
grid = cp.meshgrid(*tuple(cp.arange(s) for s in labels.shape),
indexing='ij')
labels = labels.ravel()
# index starting from 1 in the output of bincount to exclude the background
areas = cp.bincount(labels)[1:]
centroids = cp.stack(
[cp.bincount(labels, weights=g.ravel())[1:] for g in grid],
axis=-1
)
centroids /= areas[:, cp.newaxis]
return areas, centroids
areas
will be 1D array with entries for labels 1 through n_label
centroids
will be a 2D array of shape (n_label, labels.ndim)
@grlee77 Thank you very much! This moves us much closer to goal. The other feature I am interested in is mean_intensity. Do you have any suggestion for a work around?
The other feature I am interested in is mean_intensity. Do you have any suggestion for a work around?
Can you try cupyx.scipy.ndimage.mean
and see if it is sufficient? (or if it is faster, cupyx.scipy.ndimage.sum
and then divide by the already computed areas from above to get the mean)
Thanks for the suggestion! Some preliminary runs show that there is some performance gain with cupyx.scipy.ndimage.mean. However, it appears sporadic.
In other runs, I see it take anywhere from 4 to 20 seconds. I am not sure what causes this difference. Is there some sort of cache I should be clearing?
another run:
r
@grlee77 any idea on @rjesud 's comment above?
In other runs, I see it take anywhere from 4 to 20 seconds. I am not sure what causes this difference. Is there some sort of cache I should be clearing?
Do you mean for cupyx.scipy.ndimage.mean
? There should be overhead the very first time it is used as the kernels get compiled at runtime. After that, the compiled kernels get loaded from memory (or from disk if the session has been restarted).
To improve the accuracy you should technically get the device prior to starting timing via
dev = cupy.cuda.Device()
and then insert
dev.synchronize()
immediately after the function you are benchmarking. This will make sure kernel execution has fully completed before time()
is called. (CUDA kernels are launched asynchronously and may return control to the host before they actually complete).
Alternatively, CuPy has a utility called repeat that can be used for this purpose and takes care of the synchronization for you. For relatively long kernel runs as seen here, you will want to change the large default number of repetitions to something much smaller like n_warmup=1
, n_repeat=5
. This would call the kernel once as a dry run and then average the duration of the next 5 runs.