cucim [QST] Is it possible to exchange data on GPU with OpenCV CUDA?

trafficstars

Is it possible to exchange data on GPU with OpenCV CUDA?

In other words, I want to perform operations with OpenCV on the GPU with CUDA, and after that (instead of downloading the data back from the GPU) I want to directly operate with the GPU on that memory. Eventually, some conversion should not be a problem (I just want to keep the data in GPU memory).

Any hacky solution would also not be a problem.

Michel

Jul 07 '22 14:07 Michelvl92

Hi @Michelvl92 , Thank you for your interest and the question!

cuCIM's image processing part is based on CuPy and CuPy's array object supports both __cuda_array_interface__ and DLPack.

CuPy has this document(https://docs.cupy.dev/en/stable/user_guide/interoperability.html) for interoperability with other frameworks including PyTorch and Numba.

Apparently, OpenCV CUDA doesn't support both __cuda_array_interface__ and DLPack.

And there is a blog article about integrating Python OpenCV CUDA with other frameworks including CuPy.

https://www.simonwenkel.com/notes/software_libraries/opencv/opencv-cuda-integration.html

With OpenCV version that include https://github.com/opencv/opencv/pull/16513 (Merged to master branch on March 5, 2020), I think you should be able to convert OpenCV's object to CuPy Array and use cuCIM's scikit-image API with CUDA.


import cv2
import numpy as np
import cupy as cp
from cucim.skimage import color

class CudaArrayInterface:
    def __init__(self, gpu_mat):
        w, h = gpu_mat.size()
        type_map = {
            cv2.CV_8U: "u1", cv2.CV_8S: "i1",
            cv2.CV_16U: "u2", cv2.CV_16S: "i2",
            cv2.CV_32S: "i4",
            cv2.CV_32F: "f4", cv2.CV_64F: "f8",
        }
        self.__cuda_array_interface__ = {
            "version": 2,
            "shape": (h, w),
            "data": (gpu_mat.cudaPtr(), False),
            "typestr": type_map[gpu_mat.type()],
            "strides": (gpu_mat.step, gpu_mat.elemSize()),
        }

# Create GPU array with OpenCV
data_gpu_cv = cv2.cuda_GpuMat()
data_gpu_cv.upload(np.eye(64, dtype=np.float32))

# Modify the same GPU array with CuPy
data_gpu_cp = cp.asarray(CudaArrayInterface(data_gpu_cv))

## Use cuCIM's image processing
# ihc_hed = color.rgb2hed(data_gpu_cp)

# Download and verify
assert np.allclose(data_gpu_cp.get(), np.eye(64) * 42.0)

Please let us know if you have any questions. Thanks!

Jul 08 '22 20:07 gigony

Thank you for your example!

Unfortunately, it is not working.

My array is a cv2.CV_8UC3 (see https://gist.github.com/yangcha/38f2fa630e223a8546f9b48ebbb3e61a)

Therefore have set as follows: "typestr": "u1"

But I am getting the following error:

dif_bin_cp_d = cp.asarray(CudaArrayInterface(dif_bin_d))
  File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/from_data.py", line 76, in asarray
    return _core.array(a, dtype, False, order)
  File "cupy/_core/core.pyx", line 2266, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2290, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2418, in cupy._core.core._array_default
ValueError: Unsupported dtype object

I am not sure what is wrong here

Jul 13 '22 23:07 Michelvl92

Hi @Michelvl92,

Please try to understand __cuda_array_interface__(https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html) and create the interface by using what cv2.cuda_GpuMat provides (https://docs.opencv.org/4.6.0/d0/d60/classcv_1_1cuda_1_1GpuMat.html#ab02f97698d8272f0d253f3029329ed10)

In the below example, I am assuming a 4 x 3 matrix (cv2.cuda_GpuMat((3, 4), cv2.CV_8UC3)) as an input, and convert the matrix to CuPy array without copying.

You can update type_map and generalize the class for other multi-channel OpenCV image types.

import cupy as cp
import numpy as np
import cv2

# Create 4 x 3 matrix
np_arr = np.array([[[ 1,  1,  1], [ 2,  2,  2], [ 3,  3,  3]],
                  [[ 4,  4,  4], [ 5,  5,  5], [ 6,  6,  6]],
                  [[ 7,  7,  7], [ 8,  8,  8], [ 9,  9,  9]],
                  [[10, 10, 10], [11, 11, 11], [12, 12, 12]]], dtype=np.uint8)

np_arr
# array([[[ 1,  1,  1],
#         [ 2,  2,  2],
#         [ 3,  3,  3]],
# 
#        [[ 4,  4,  4],
#         [ 5,  5,  5],
#         [ 6,  6,  6]],
# 
#        [[ 7,  7,  7],
#         [ 8,  8,  8],
#         [ 9,  9,  9]],
# 
#        [[10, 10, 10],
#         [11, 11, 11],
#         [12, 12, 12]]], dtype=uint8)

# Below is not necessarily, but just to show how CuPy's __cuda_array_interface__ looks like (strides is None which means contiguous memory)
cp_arr = cp.asarray(np_arr)
print(cp_arr.__cuda_array_interface__)
# {'shape': (4, 3, 3), 'typestr': '|u1', 'descr': [('', '|u1')], 'stream': 1, 'version': 3, 'strides': None, 'data': (140668563357696, False)}

# Create 4 x 3 (width:3, height:4) GpuMat => (3, 4) for the first parameter
cv2_arr = cv2.cuda_GpuMat((3, 4), cv2.CV_8UC3)

cv2_arr.upload(np_arr)

class CudaArrayInterface:
    def __init__(self, gpu_mat):
        w, h = gpu_mat.size()
        type_map = {
            cv2.CV_8U: "|u1",
            cv2.CV_8UC1: "|u1",
            cv2.CV_8UC2: "|u1",
            cv2.CV_8UC3: "|u1",
            cv2.CV_8UC4: "|u1",
            cv2.CV_8S: "|i1",
            cv2.CV_16U: "<u2", cv2.CV_16S: "<i2",
            cv2.CV_32S: "<i4",
            cv2.CV_32F: "<f4", cv2.CV_64F: "<f8",
        }
        self.__cuda_array_interface__ = {
            "version": 3,
            "shape": (h, w, gpu_mat.channels()),
            "typestr": type_map[gpu_mat.type()],
            "descr": [("", type_map[gpu_mat.type()])],
            "stream": 1,
            "strides": (gpu_mat.step, gpu_mat.elemSize(), gpu_mat.elemSize1()),
            "data": (gpu_mat.cudaPtr(), False),
        }

# This __cuda_interface_array__'s strides is not None which means non-contiguous memory
cuda_interface = CudaArrayInterface(cv2_arr)
print(cuda_interface.__cuda_array_interface__)
#{'version': 3, 'shape': (4, 3, 3), 'typestr': '|u1', 'descr': [('', '|u1')], 'stream': 1, 'strides': (512, 3, 1), 'data': (140668563358208, False)}

gpu_cp = cp.asarray(cuda_interface)
# shows the same data pointer
print(gpu_cp.__cuda_array_interface__)
#{'shape': (4, 3, 3), 'typestr': '|u1', 'descr': [('', '|u1')], 'stream': 1, 'version': 3, 'strides': (512, 3, 1), 'data': (140668563358208, False)}

# same data
gpu_cp
# array([[[ 1,  1,  1],
#         [ 2,  2,  2],
#         [ 3,  3,  3]],
# 
#        [[ 4,  4,  4],
#         [ 5,  5,  5],
#         [ 6,  6,  6]],
# 
#        [[ 7,  7,  7],
#         [ 8,  8,  8],
#         [ 9,  9,  9]],
# 
#        [[10, 10, 10],
#         [11, 11, 11],
#         [12, 12, 12]]], dtype=uint8)

# Some image processing algorithms require (or assume) contiguous memory. In the case, you can 'copy' non-contiguous memory to contiguous memory
contigous_cp = cp.ascontiguousarray(gpu_cp)

print(contigous_cp.__cuda_array_interface__)
# {'shape': (4, 3, 3), 'typestr': '|u1', 'descr': [('', '|u1')], 'stream': 1, 'version': 3, 'strides': None, 'data': (140668563360256, False)}
# now strides is None.

contigous_cp
# array([[[ 1,  1,  1],
#         [ 2,  2,  2],
#         [ 3,  3,  3]],
# 
#        [[ 4,  4,  4],
#         [ 5,  5,  5],
#         [ 6,  6,  6]],
# 
#        [[ 7,  7,  7],
#         [ 8,  8,  8],
#         [ 9,  9,  9]],
# 
#        [[10, 10, 10],
#         [11, 11, 11],
#         [12, 12, 12]]], dtype=uint8)

Jul 16 '22 04:07 gigony

This conversation explains how should conversion from cuda_GpuMat to cupy array look like, but do you have any ideas how to do it other way around? IE the cupy array to the cuda_GpuMat My simplified use case would be:

Capture image,
upload to cuda_GpuMat
convert color scheme with cvtColor
crop the image with cupy
do some more operations using cuda_GpuMat
do some TensorFlow operations (transformation from cupy to tf is straightforward) and all that I would like to do without downloading/uploading the image between steps 4 and 5

Jul 30 '22 12:07 KonradCichocki7

@Darnok99

To my knowledge it is not possible to convert any cupy array back to cv2.cuda_GpuMat when using Python. However, it is not an issue when using C++. A general workaround is to allocate another cv2.cuda_GpuMat with known size, add another cuda_array_interface around it and copy the data back.

In your specific case, you could simply skip the upload/download between steps 4 and 5 assuming that you really only do the cropping in step 4 by moving the crop to position number 2 as it is basically a memory view that is applied before allocating and uploading the first cv2.cuda_GpuMat.

I hope that this answer is helpful.

Aug 17 '22 11:08 swenkel

@swenkel I am attempting this procedure as well, I feel adding a second cuda_array_interface would likely be the most straight forward method; however, how would one go about the type mapping?

type_map = { cv2.CV_8U: "|u1", <---- cv2.CV_8UC1: "|u1", <---- cv2.CV_8UC2: "|u1", <---- cv2.CV_8UC3: "|u1", <---- cv2.CV_8UC4: "|u1", <---- cv2.CV_8S: "|i1", cv2.CV_16U: "<u2", cv2.CV_16S: "<i2", cv2.CV_32S: "<i4", cv2.CV_32F: "<f4", cv2.CV_64F: "<f8", }

This doesn't look reversible.

Jan 14 '23 20:01 manbehindthemadness

At some point in early 2020 I wrote the initial blog post that is linked in this answer above (https://github.com/rapidsai/cucim/issues/329#issuecomment-1179327917).

Edited (14 h later): I did re-test a couple of things recently and now (OpenCV 4.7.0) it seems like the method explained to access cv2.GpuMat from CuPy (version 11.3.0) seems to create a copy and does no longer access the pointer as pure reference. The copy is cheap (compute time) as it happens on a CUDA device but nevertheless it is a copy and makes the workaround mentioned in an earlier answer impossible. This observation might have been example specific. I can no longer reproduce this observation.

You are interested in the reverse which is moving data from a CuPy array to a cv2.GpuMat. No cuda_array_interface can help with that - besides everything would need does exist and works when programming C++. Using libtorch it would look as follows:

torch_tensor = torch_tensor.permute({0, 2, 3, 1});
torch_tensor = torch_tensor.squeeze(0);
cv::cuda::GpuMat gFrame_from_tensor(cv::Size(1280,720), CV_32FC3, torch_tensor.data_ptr());

At least when programming in C++ type mapping does not seems to be an issue. If OpenCV's Python API would be behave exactly like the C++ the following example would work:

import cupy as cp
import cv2
import numpy as np

img = np.zeros((5,5),dtype=np.uint8)
img_cp = cp.asarray(img)
img_cp[0:3,:] = 5
img_cv2_cu = cv2.cuda_GpuMat(img_cp.__cuda_array_interface__['shape'],
                             cv2.CV_8U, # or cv2.CV_8UC1
                             img_cp.__cuda_array_interface__['data'][0])

print('CuPy Array')
print(img_cp)
print('CuPy CUDA Pointer:', img_cp.__cuda_array_interface__['data'][0])
print()
print('cv2.GpuMat')
print(img_cv2_cu.download())
print(f'cv2.cuda_GpuMat CUDA Pointer: {img_cv2_cu.cudaPtr()}')

But it does not work. It returns

CuPy Array
[[5 5 5 5 5]
 [5 5 5 5 5]
 [5 5 5 5 5]
 [0 0 0 0 0]
 [0 0 0 0 0]]
CuPy CUDA Pointer: 140248967185920

cv2.GpuMat
[[255 255 255 255 255]
 [255 255 255 255 255]
 [255 255 255 255 255]
 [255 255 255 255 255]
 [255 255 255 255 255]]
cv2.cuda_GpuMat CUDA Pointer: 140248967188992

I assume that the Python API calls some default allocator and does not use the CUDA pointer. This might be an issue that could be caused by

Python internals
OpenCV's Python API implementation

Jan 19 '23 19:01 swenkel

I assume that the Python API calls some default allocator and does not use the CUDA pointer. This might be an issue that could be caused by

Python internals

OpenCV's Python API implementation

This is because this functionality is not exposed to python see cuda.hpp. When you call

img_cv2_cu = cv2.cuda_GpuMat(img_cp.__cuda_array_interface__['shape'],
                             cv2.CV_8U, # or cv2.CV_8UC1
                             img_cp.__cuda_array_interface__['data'][0])

you are actually creating a new GpuMat with the value img_cp.__cuda_array_interface__['data'][0] using the following function signature.

Unfortunatley due to the way the python bindings are generated and the order that function parameters are resolved this functionality cannot be added by simply wrapping the existing methods, which is probably one reason why they have been left unwrapped.

I have added a new cv.cuda.createGpuMatFromCudaMemory method in https://github.com/opencv/opencv/pull/23371 which works for me on the modified example below.

import cupy as cp
import cv2
import numpy as np

img = np.zeros((5,5),dtype=np.uint8)
img_cp = cp.asarray(img)
img_cp[0:3,:] = 5

img_cv2_cu = cv2.cuda.createGpuMatFromCudaMemory(img_cp.__cuda_array_interface__['shape'],
                             cv2.CV_8U, # or cv2.CV_8UC1
                             img_cp.__cuda_array_interface__['data'][0])
print('CuPy Array')
print(img_cp)
print('CuPy CUDA Pointer:', img_cp.__cuda_array_interface__['data'][0])
print()
print('cv2.GpuMat')
print(img_cv2_cu.download())
print(f'cv2.cuda_GpuMat CUDA Pointer: {img_cv2_cu.cudaPtr()}')

@swenkel Can you let me know if this fixes the issue for you?

Mar 17 '23 15:03 cudawarped

It is quite possible to use a shared memory space to bidirectionally move data from cupy <> torch <> OpenCV <> pycuda ; however, it essential to be aware of the runtime contexts at work in your business logic. Order of imports, loose contexts created by third party libraries (OpenSSL for one), and especially graphical rendering through libraries such as OpenGL. That being said if you want to play with some (mostly) working experiments see here: https://github.com/manbehindthemadness/blood-magic

Do be aware that this was prototyped using tegra, so some tweaking might be needed for a non-unified memory layout.

Mar 17 '23 15:03 manbehindthemadness

@cudawarped I just tried code from your PR, and it works great! I'm using it with this helper function I wrote:

import cv2
import cupy as cp


def cv_cuda_gpumat_from_cp_array(arr: cp.ndarray) -> cv2.cuda.GpuMat:
    assert len(arr.shape) in (2, 3), "CuPy array must have 2 or 3 dimensions to be a valid GpuMat"
    type_map = {
        cp.dtype('uint8'): cv2.CV_8U,
        cp.dtype('int8'): cv2.CV_8S,
        cp.dtype('uint16'): cv2.CV_16U,
        cp.dtype('int16'): cv2.CV_16S,
        cp.dtype('int32'): cv2.CV_32S,
        cp.dtype('float32'): cv2.CV_32F,
        cp.dtype('float64'): cv2.CV_64F
    }
    depth = type_map.get(arr.dtype)
    assert depth is not None, "Unsupported CuPy array dtype"
    channels = 1 if len(arr.shape) == 2 else arr.shape[2]
    # equivalent to unexposed opencv C++ macro CV_MAKETYPE(depth,channels):
    # (depth&7) + ((channels - 1) << 3)
    mat_type = depth + ((channels - 1) << 3)
    mat = cv2.cuda.createGpuMatFromCudaMemory(arr.__cuda_array_interface__['shape'][1::-1],
                                              mat_type,
                                              arr.__cuda_array_interface__['data'][0])
    return mat

P.S. Updated to arbitrary channel number

May 05 '23 16:05 alllexx88

I know the question below is very old, but the solution might help someone who found this issue by googling.

@Michelvl92

Unfortunately, it is not working.

My array is a cv2.CV_8UC3 (see https://gist.github.com/yangcha/38f2fa630e223a8546f9b48ebbb3e61a)

Therefore have set as follows: "typestr": "u1"

But I am getting the following error:
dif_bin_cp_d = cp.asarray(CudaArrayInterface(dif_bin_d))
  File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/from_data.py", line 76, in asarray
    return _core.array(a, dtype, False, order)
  File "cupy/_core/core.pyx", line 2266, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2290, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2418, in cupy._core.core._array_default
ValueError: Unsupported dtype object
I am not sure what is wrong here

Here's an edit to the code by @gigony to work with any channel number: (the idea is to use gpu_mat.depth() instead of gpu_mat.type())

import cv2
import cupy as cp


def cp_array_from_cv_cuda_gpumat(mat: cv2.cuda.GpuMat) -> cp.ndarray:
    class CudaArrayInterface:
        def __init__(self, gpu_mat: cv2.cuda.GpuMat):
            w, h = gpu_mat.size()
            type_map = {
                cv2.CV_8U: "|u1",
                cv2.CV_8S: "|i1",
                cv2.CV_16U: "<u2", cv2.CV_16S: "<i2",
                cv2.CV_32S: "<i4",
                cv2.CV_32F: "<f4", cv2.CV_64F: "<f8",
            }
            self.__cuda_array_interface__ = {
                "version": 3,
                "shape": (h, w, gpu_mat.channels()) if gpu_mat.channels() > 1 else (h, w),
                "typestr": type_map[gpu_mat.depth()],
                "descr": [("", type_map[gpu_mat.depth()])],
                "stream": 1,
                "strides": (gpu_mat.step, gpu_mat.elemSize(), gpu_mat.elemSize1()) if gpu_mat.channels() > 1
                else (gpu_mat.step, gpu_mat.elemSize()),
                "data": (gpu_mat.cudaPtr(), False),
            }
    arr = cp.asarray(CudaArrayInterface(mat))

    return arr

It also returns a 2-D cupy array for a one-channel GpuMat, just like gpu_mat.download() gives you 2-D numpy arrays in these cases.

May 14 '23 17:05 alllexx88

See here for some working examples https://github.com/manbehindthemadness/blood-magic

Take note that new developments from @cudawarped on the cv2 codebase in relation to memory pointer exchange will likely provide more robust options in the near future.

May 14 '23 17:05 manbehindthemadness

Can this be closed as https://github.com/opencv/opencv/pull/23371 has now been merged.

If anyone has the latest version of CUDA (12.1) and cuDNN (8.9.1) installed they can test this change with pre-built wheels from https://github.com/cudawarped/opencv-python-cuda-wheels/releases/tag/4.7.0.20230527

May 28 '23 10:05 cudawarped

Thanks @cudawarped for the update and for getting the new external memory functionality merged in OpenCV. So if we want to point users to this new capability the requirement will be to use OpenCV>=4.8?

Jun 20 '23 17:06 grlee77

So if we want to point users to this new capability the requirement will be to use OpenCV>=4.8?

That's an interesting question, if this wasn't part of the contrib repo I would say yes wait for the 4.8.0 wheel to be released (normally a few days after the main release). However to get access to the CUDA python bindings you have to build yourself so any commits after this was merged (22 May) should be good. For ease and as a few things have been moved from the contrib to the main repo lately I would just clone from the tip of the 4.x branches.

And if you have the latest version of CUDA you can try just installing the latest wheel I linked to above.

Jun 20 '23 18:06 cudawarped

This is very exciting, I will build against the AGX Orin and put it through the paces during my next round of development.

Jun 20 '23 18:06 manbehindthemadness

cucim cucim copied to clipboard

[QST] Is it possible to exchange data on GPU with OpenCV CUDA?

cucim
cucim copied to clipboard