client Adding DtoD support to CUDA shared memory feature

trafficstars

Hello world,

This patch:

Adds DtoD CUDA memcopy support which allows to pass video frames decoded in CUDA memory to Triton.
Changes API of set_shared_memory_region function to accept CUDA memory beside numpy arrays.

I'm not certain about changes in Python module API so please free to give me guidance on how to implement it better.

Jan 21 '22 11:01 rarzumanyan

@rarzumanyan Can you describe how your changes work? For device=gpu, what is the input_values for set_shared_memory_region?

I would suggest adding an example demonstrating the DtoD copies similar to simple_http_cudasm_client. You can can have some transitive flow of data on the GPU. It will be useful in testing and understanding the API as well.

Jan 24 '22 22:01 tanmayv25

Hi @tanmayv25

Can you describe how your changes work?

Sure. The main change in the patch is the CudaSharedMemoryRegionSetDptr function which acts very similar to CudaSharedMemoryRegionSet.

The only difference between them is that CudaSharedMemoryRegionSet performs HtoD CUDA memcpy, while CudaSharedMemoryRegionSetDptr does DtoD CUDA memcpy.

For device=gpu, what is the input_values for set_shared_memory_region?

When device==gpu, set_shared_memory_region function expects input pointer to be a memory region in vRAM, allocated with CUDA driver or runtime API.

I would suggest adding an example demonstrating the DtoD copies

Makes total sense, I just want to make sure you're happy with proposed API changes first. As soon as this is sorted out, I'll add one more patch with sample to this PR.

Jan 25 '22 13:01 rarzumanyan

Hi @CoderHam

Why not use a device flag and condition on that to use cudaMemcpyDeviceToDevice or cudaMemcpyHostToDevice?

That's perfectly fine by me to add extra flag. However, I didn't add it due for following reasons:

Didn't want to contaminate namespace with yet-another enum for device identification and export it to Python.
cudaMemcpy has enum to identify the memcpy direction but not all values are relevant.
Integer or boolean flag isn't easy to follow.

That's why I've decided to introduce another function for that. If you want CudaSharedMemoryRegionSet signature to be extended with device flag anyway - please let me know, I'll do that.

Jan 26 '22 10:01 rarzumanyan

@rarzumanyan on second thought your arguments for not contaminating the namespace / enum are valid. You can just make the new changes I suggested for handling the input_values field specially for the case where it is a (cuda) device pointer

Feb 16 '22 22:02 CoderHam

@tanmayv25 who would be a good person to task with reviewing the remainder of this code/review? I'm not familiar with this kind of code.

Nov 03 '22 22:11 matthewkotila

@jbkyang-nvi, @tanmayv25 has suggested you might be a good person to request a review from for this code. Can you help with that?

Nov 28 '22 17:11 matthewkotila

@rarzumanyan are you able to add testing for this feature?

Nov 28 '22 17:11 matthewkotila

Hello, I also encountered the same problem. In one of my face recognition projects, due to the large amount of data, batch>64, and the recognition model depends on the results of the detection model, so I will detect the results of the model (need nms etc.) are processed on the GPU (cupy, torch), but due to triton’s official cudashm.set_shared_memory_region(shm_input_handle, [img]), the img object can only be numpy, so I have to move the post-processed data Go back to the CPU. Until I came across your article, I have been doing this, because I am a novice in C++, I don't know how to modify the code, so that the cudashm of python's tritonclient supports uploading tensor data on the GPU. I hope you can tell me how I can use your branch to build a python module so that my python program can send the tensor on the GPU to triton? Thanks a lot；Thanks a lot;Thanks a lot.@rarzumanyan

May 10 '23 09:05 muqishan

client client copied to clipboard

Adding DtoD support to CUDA shared memory feature

client
client copied to clipboard