opencv_contrib Morphology operation is slower on GPU than same operation on CPU

System information

OpenCV => 4.8.0
Operating System / Platform => Nvidia JetPack 6.0 (Ubuntu 22.04) / Nvidia Jetson Orin NX
Compiler => gcc 11.4.0

Detailed description

I am trying to test the morphology operation on GPU, the test platform is Nvidia Jetson Orin NX, the JetPack 6.0 (R36.3.0, Ubuntu 22.04) + cuda 12.2 is installed on the test platform. I download, build and installed the OpenCV 4.8.0 and opencv_contrib 4.8 on the test platform, and then, I code a test program to call morphology interfaces on GPU (cuda::MorphologyFilter->apply) and CPU (morphologyEx) to compare its performance, in my opinion, the same operation on GPU should be faster than on CPU, but in my test, the morphology operation is slower on GPU than it on CPU. The source of test program and relative image file are attached in a zip file for your reference. Could you please help me to check if any problem or bug happens in my test program?

Steps to reproduce

compiler the program with command "g++ -D PROFILE_SAMPLE -o morph morph.cpp pkg-config --cflags --libs opencv4 -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -l cudart", an executable file "morph" is generated if no error happens;
run the program morph which built by the step 1 without any arguments, it will read the image file "./baboon.jpg", and do morphology operations on GPU and CPU, and calculate the time cost by the operations on GPU and CPU, print out the time cost on the terminal.
on my test platform - Jetson Orin NX, the printed out is listed below:

(base) nvidia@ubuntu:~/opencv/rep$ ./morph 264 times morphology operations are done with OpenCV on GPU! Time for morphology : 4396.36 ms in system time : 4397 ms 264 times morphology operations are done with OpenCV on CPU! Time for morphology : 1867.39 ms in system time : 1868 ms

We can find that it costs 4396.36ms to do 264 times morphology operations on GPU, and it costs 1867.39ms to do 264 times morphology operations on CPU.

source code and relative image file

morph.zip

Sep 23 '24 10:09 cg3dland

How does the timing scale with kernel size? Based on your observation and without checking the implementation I would expect the smaller kernels to be faster utilizing shared memory and and the larger ones to be slower falling back to global memory usage.

I would also use CUDA streams if you are timing n iterations because your timing is including device stalls due to the internal use of cudaDeviceSyncrhonize(). That said this alone is unlikely to result in your gpu code executing faster than your cpu code on your device.

Edit: Your not just timing the execution of the morphology operation inside transform_gpu (shown below). You are also timing the creation of element on the host, and openFilter and dst on the device. In a real application you would initialize these at the start because doing so on the GPU is very costly. Therefore you need to remove them from your timing to get an accurate time cost.

cuda::GpuMat App::transform_gpu(cuda::GpuMat src, int morph_elem, int morph_operator, int kernel_size, cuda::Stream stream)
{
    cuda::GpuMat dst;

    //printf("elem: %d, opr: %d, kernel_size: %d\n", morph_elem, morph_operator, kernel_size);
    Mat element = getStructuringElement(morph_elem, Size(kernel_size * 2 + 1, kernel_size * 2 + 1), Point(kernel_size, kernel_size));

    Ptr<cuda::Filter> openFilter = cuda::createMorphologyFilter(morph_operator, src.type(), element);
    openFilter->apply(src, dst, stream);

    return dst;
}

Ideally you also want to have a few untimed warm up runs before you start timing to ensure the context has been created and the device code loaded.

Sep 23 '24 10:09 cudawarped

I am sorry I do not fully understand you, I would like to double check with you about:

kernel size, it is the parameter kernel_size which passed in function getStructuringElement() call, but not the kernel size of OS, am I right? May I learn from you on how to calculate the kernel_size which passed in getStructuringElement() from the source image size?
I should just calculate the time costed on the function call openFilter->apply() in my test program, the element, openFilter and dst which used in transform_gpu() should be initialized before the function transform_gpu() call. In my test program, I have to create multiple elements and openFilters for different parameters, this will cost more GPU memory.
May I think I could get benefit from GPU in the case that do morphology operation on multiple images with same morphology parameters? If I have to do morphology operation on multiple images with different morphology parameters, how can I get benefit from GPU in this case, by using the cuda::Stream to do async operations?

Sep 24 '24 07:09 cg3dland

kernel size, it is the parameter kernel_size which passed in function getStructuringElement() call, but not the kernel size of OS, am I right? May I learn from you on how to calculate the kernel_size which passed in getStructuringElement() from the source image size?

I was refering to the kernel_size.

2. I should just calculate the time costed on the function call openFilter->apply() in my test program, the element, openFilter and dst which used in transform_gpu() should be initialized before the function transform_gpu() call.

Correct.

3. May I think I could get benefit from GPU in the case that do morphology operation on multiple images with same morphology parameters? If I have to do morphology operation on multiple images with different morphology parameters, how can I get benefit from GPU in this case, by using the cuda::Stream to do async operations?

To get maximum benefit from the GPU you want to initialize your filter and your source and destination arrays on the GPU at the begining of your program. You may not get any benefit from using the GPU if you are only performing one operation per image. A lot will depend on how many images, how many operations per image and what you want to do with the destination image, essentially what your program is trying to achieve.

Anyway this is not an issue and should be redirected to https://forum.opencv.org/.

Sep 24 '24 08:09 cudawarped

I make some update on the test program:

initialize the dst, element, openFilter at the beginning of function run() only once;
call the function openFilter->apply(), morphologyEx() in 2 seperate loops, and calculate the time costs of these 2 loops.

Please refer to the attached program. morph.zip

In the test, the time cost on openFilter->apply() (GPU) is more than the time cost on morphologyEx() (CPU), what is the reason of this result (GPU cost more time than CPU)?

The print out from the test program:

264 times morphology operations are done with OpenCV on GPU! Time for morphology : 1918.12 ms in system time : 1919 ms 264 times morphology operations are done with OpenCV on CPU! Time for morphology : 330.384 ms in system time : 331 ms

BTW, the time cost decline obviously when the kernel_size is set a small value, but my issue is, the time cost on GPU is more than on CPU, this is the reason I log this issue report.

Sep 24 '24 12:09 cg3dland

In the test, the time cost on openFilter->apply() (GPU) is more than the time cost on morphologyEx() (CPU), what is the reason of this result (GPU cost more time than CPU)?

There's two issues at play here

For larger filter sizes the CPU version is faster than the GPU one. On my system this is true for anything greater than 3x3. OpenCV uses Nvidia NPP under the hood to perform these operations so that is down to there implementation which may or may not be optimal.
OpenCV uses the old NPP streaming API which adds unecessary synchronization. This means for a 3x3 filter on your system the GPU version will most likely be slower that the CPU one anyway.

Sep 24 '24 15:09 cudawarped

May I think it applies to all OpenCV filter operations and other image transformations (e.g. threshold) that for larger filter sizes the CPU version is faster than the GPU one?

Sep 25 '24 09:09 cg3dland

May I think it applies to all OpenCV filter operations and other image transformations (e.g. threshold) that for larger filter sizes the CPU version is faster than the GPU one?

That would depend on the CPU and GPU combination but my guess based on your test results would be that this is true.

Sep 25 '24 10:09 cudawarped

Is it possible to set the filter size? How to set the filter size if it could be set? （I could not find a way to set filter size with the parameters of morphology interface)

Sep 25 '24 11:09 cg3dland

Is it possible to set the filter size? How to set the filter size if it could be set? （I could not find a way to set filter size with the parameters of morphology interface)

filter width/height = kernel_size * 2 + 1

Sep 25 '24 11:09 cudawarped

some test results:

kernel_size = 0: 264 times morphology operations are done with OpenCV on GPU! Time for morphology : 92.4097 ms in system time : 94 ms 264 times morphology operations are done with OpenCV on CPU! Time for morphology : 20.5716 ms in system time : 20 ms

kernel_size = 1: 264 times morphology operations are done with OpenCV on GPU! Time for morphology : 114.533 ms in system time : 115 ms 264 times morphology operations are done with OpenCV on CPU! Time for morphology : 63.5391 ms in system time : 63 ms

kernel_size = 2: 264 times morphology operations are done with OpenCV on GPU! Time for morphology : 262.029 ms in system time : 263 ms 264 times morphology operations are done with OpenCV on CPU! Time for morphology : 125.501 ms in system time : 126 ms

It shows that GPU is always slower than CPU to do morphology operation, I wonder if the filter size is set by the 2nd parameter (ksize) of cv::getStructuringElement().

Sep 25 '24 11:09 cg3dland

kernel_size = 0:

Isn't this just a copy?

kernel_size = 1:

As I said this will depend on the CPU/GPU you are comparing and

OpenCV uses the old NPP streaming API which adds unecessary synchronization. This means for a 3x3 filter on your system the GPU version will most likely be slower that the CPU one anyway.

I should also say that it will depend on the size of the image being processed, see my results below:

Original Image Size (512x512), `kernel_size = 1`

264 times morphology operations are done with OpenCV on RTX 3070 Ti! Removing old NPP API synchronization Time for morphology : 5.04144 ms in system time : 5 ms 264 times morphology operations are done with OpenCV on RTX 3070 Ti! Time for morphology : 17.3483 ms in system time : 18 ms 264 times morphology operations are done with OpenCV on i7-12700H! Time for morphology : 0.003072 ms in system time : 7 ms

Image Size (1024x1024) ,`kernel_size = 1` (`cv::resize(img_, img, img_.size() * 2)`)

264 times morphology operations are done with OpenCV on RTX 3070 Ti! Removing old NPP API synchronization Time for morphology : 11.3089 ms in system time : 12 ms 264 times morphology operations are done with OpenCV on RTX 3070 Ti! Time for morphology : 32.2662 ms in system time : 33 ms 264 times morphology operations are done with OpenCV on i7-12700H! Time for morphology : 0.002048 ms in system time : 30 ms

Image Size (2048x2048), `kernel_size` = 1 (`cv::resize(img_, img, img_.size() * 4)`)

`kernel_size = 1`

264 times morphology operations are done with OpenCV on RTX 3070 Ti! Removing old NPP API synchronization Time for morphology : 30.5614 ms in system time : 31 ms 264 times morphology operations are done with OpenCV on RTX 3070 Ti! Time for morphology : 49.2371 ms in system time : 51 ms 264 times morphology operations are done with OpenCV on i7-12700H! Time for morphology : 0.002048 ms in system time : 99 ms

`kernel_size = 2`

264 times morphology operations are done with OpenCV on RTX 3070 Ti! Removing old NPP API synchronization Time for morphology : 153.737 ms in system time : 155 ms 264 times morphology operations are done with OpenCV on RTX 3070 Ti! Time for morphology : 220.989 ms in system time : 221 ms 264 times morphology operations are done with OpenCV on i7-12700H! Time for morphology : 0.00304 ms in system time : 203 ms

Sep 25 '24 15:09 cudawarped

Thank you for sharing your test result.

May I have the time costs on CPU at your side for comparing with GPU?

Thanks,

Sep 26 '24 11:09 cg3dland

May I have the time costs on CPU at your side for comparing with GPU?

They're included, my CPU is an i7-12700H.

Sep 26 '24 12:09 cudawarped

I learn from your test result, the performance on the OpenCV GPU without old NPP API synchronization get obviously benefit, may I learn from you about the reason that current OpenCV implementation to use the old NPP streaming API?

Sep 27 '24 12:09 cg3dland

the reason that current OpenCV implementation to use the old NPP streaming API?

Nobody has put in a PR yet to update this. I may update it if/when versions of CUDA which don't support the new API are no longer supported by OpenCV but not before as I don't want to add macros to support two versions.

Sep 27 '24 15:09 cudawarped

opencv_contrib opencv_contrib copied to clipboard

Morphology operation is slower on GPU than same operation on CPU

System information

Detailed description

Steps to reproduce

source code and relative image file

Original Image Size (512x512), kernel_size = 1

Image Size (1024x1024) ,kernel_size = 1 (cv::resize(img_, img, img_.size() * 2))

Image Size (2048x2048), kernel_size = 1 (cv::resize(img_, img, img_.size() * 4))

kernel_size = 1

kernel_size = 2

opencv_contrib
opencv_contrib copied to clipboard

Original Image Size (512x512), `kernel_size = 1`

Image Size (1024x1024) ,`kernel_size = 1` (`cv::resize(img_, img, img_.size() * 2)`)

Image Size (2048x2048), `kernel_size` = 1 (`cv::resize(img_, img, img_.size() * 4)`)

`kernel_size = 1`

`kernel_size = 2`