opencv_contrib icon indicating copy to clipboard operation
opencv_contrib copied to clipboard

Add Otsu's method to cv::cuda::threshold

Open troelsy opened this issue 6 months ago • 6 comments

I implemented Otsu's method in CUDA for a separate project and want to add it to cv::cuda::threshold

I have made an effort to use existing OpenCV functions in my code, but I had some trouble with ThresholdTypes and cv::cuda::calcHist. I couldn't figure out how to include precomp.hpp to get the definition of ThresholdTypes. For cv::cuda::calcHist I tried adding opencv_cudaimgproc, but it creates a circular dependency on cudaarithm. I have include a simple implementation of calcHist so the code runs, but I would like input on how to use cv::cuda::calcHist instead.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • [x] I agree to contribute to the project under Apache 2 License.
  • [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • [ ] The PR is proposed to the proper branch
  • [ ] There is a reference to the original bug report and related work
  • [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name.
  • [ ] The feature is well documented and sample code can be built with the project CMake

troelsy avatar May 22 '25 11:05 troelsy

cc @cudawarped

asmorkalov avatar May 26 '25 07:05 asmorkalov

@troelsy Your test currently doesn't use the Otsu method and when it does it fails the assert. I can't review the PR until this is fixed.

How does the performance of this compare to the CPU version?

cudawarped avatar May 27 '25 10:05 cudawarped

I'm sorry, I didn't realize I hadn't run the test correctly. It should have been resolved now, please have another look.

There was a numerical error when computing the variance, which could make it off-by-one. Both implementations should get the same result now

I'm testing on an AWS g5.xlarge instance with a 1920x1080 GpuMat of CV_8UC1 and get the following timings (average of 10,000 iterations):

Otsu threshold CPU:   2371.643000µs
Otsu threshold GPU:   68.385000µs
Binary threshold GPU: 5.330000µs

So the current GPU version is about x35 faster than the CPU on this instance type (which is biased towards GPU). The Otsu computation adds about 63µs to the execution compared to a static binary threshold.

From my measurements, about half of the time is spend on allocating and deallocating memory, so if there is a way to reuse the allocations we could reduce the execution time significantly.

These are the individual kernel execution timings reported by Nsight Compute:

Kernel name            Minimum    Maximum    Average
histogram_kernel       4.86µs     4.90µs     4.88µs
otsu_mean              4.67µs     4.83µs     4.75µs
otsu_variance          5.63µs     5.66µs     5.65µs
otsu_score             4.26µs     4.29µs     4.27µs
cv::transformSmart     3.23µs     3.23µs     3.23µs

See also the screenshot of Nsight Systems for a visualization of the execution timeline. Mind that it says ~80µs, but that is due to profilling overhead. Screenshot 2025-05-27 at 17 06 54

troelsy avatar May 27 '25 15:05 troelsy

See also the screenshot of Nsight Systems for a visualization of the execution timeline. Mind that it says ~80µs, but that is due to profilling overhead.

You're measuring the API calls there not the execution time of the kernels/memory allocs etc.. As you are including synchronization this should be about the same but in general it is not the same thing. If there wasn't any synchronization the time for the API calls would be much much smaller than the execution time.

From my measurements, about half of the time is spend on allocating and deallocating memory, so if there is a way to reuse the allocations we could reduce the execution time significantly.

You should be able to reduce the allocations/deallocations by using BufferPool. The other main bottleneck is the copy of the threshold from device to host but because the NPP routines use a host threshold there isn't anything you can do about this.

cudawarped avatar May 28 '25 08:05 cudawarped

@cudawarped What's the next step? The test fails, but it says (409) Conflict: an artifact with this name already exists on the workflow run, so I'm guessing it doesn't have anything to do with my changes.

Should I do something about the histogram implementation or should we let it be for now?

troelsy avatar Jun 10 '25 06:06 troelsy

Please ignore the error. Looks like some bug in our CI scripts. I'll take a look.

asmorkalov avatar Jun 10 '25 07:06 asmorkalov