client
client copied to clipboard
Adding DtoD support to CUDA shared memory feature
Hello world,
This patch:
- Adds DtoD CUDA memcopy support which allows to pass video frames decoded in CUDA memory to Triton.
- Changes API of
set_shared_memory_regionfunction to accept CUDA memory beside numpy arrays.
I'm not certain about changes in Python module API so please free to give me guidance on how to implement it better.
@rarzumanyan Can you describe how your changes work? For device=gpu, what is the input_values for set_shared_memory_region?
I would suggest adding an example demonstrating the DtoD copies similar to simple_http_cudasm_client. You can can have some transitive flow of data on the GPU. It will be useful in testing and understanding the API as well.
Hi @tanmayv25
Can you describe how your changes work?
Sure.
The main change in the patch is the CudaSharedMemoryRegionSetDptr function which acts very similar to CudaSharedMemoryRegionSet.
The only difference between them is that CudaSharedMemoryRegionSet performs HtoD CUDA memcpy, while CudaSharedMemoryRegionSetDptr does DtoD CUDA memcpy.
For device=gpu, what is the input_values for set_shared_memory_region?
When device==gpu, set_shared_memory_region function expects input pointer to be a memory region in vRAM, allocated with CUDA driver or runtime API.
I would suggest adding an example demonstrating the DtoD copies
Makes total sense, I just want to make sure you're happy with proposed API changes first. As soon as this is sorted out, I'll add one more patch with sample to this PR.
Hi @CoderHam
Why not use a device flag and condition on that to use cudaMemcpyDeviceToDevice or cudaMemcpyHostToDevice?
That's perfectly fine by me to add extra flag. However, I didn't add it due for following reasons:
- Didn't want to contaminate namespace with yet-another enum for device identification and export it to Python.
cudaMemcpyhas enum to identify the memcpy direction but not all values are relevant.- Integer or boolean flag isn't easy to follow.
That's why I've decided to introduce another function for that.
If you want CudaSharedMemoryRegionSet signature to be extended with device flag anyway - please let me know, I'll do that.
@rarzumanyan on second thought your arguments for not contaminating the namespace / enum are valid. You can just make the new changes I suggested for handling the input_values field specially for the case where it is a (cuda) device pointer
@tanmayv25 who would be a good person to task with reviewing the remainder of this code/review? I'm not familiar with this kind of code.
@jbkyang-nvi, @tanmayv25 has suggested you might be a good person to request a review from for this code. Can you help with that?
@rarzumanyan are you able to add testing for this feature?
Hello, I also encountered the same problem. In one of my face recognition projects, due to the large amount of data, batch>64, and the recognition model depends on the results of the detection model, so I will detect the results of the model (need nms etc.) are processed on the GPU (cupy, torch), but due to triton’s official cudashm.set_shared_memory_region(shm_input_handle, [img]), the img object can only be numpy, so I have to move the post-processed data Go back to the CPU. Until I came across your article, I have been doing this, because I am a novice in C++, I don't know how to modify the code, so that the cudashm of python's tritonclient supports uploading tensor data on the GPU. I hope you can tell me how I can use your branch to build a python module so that my python program can send the tensor on the GPU to triton? Thanks a lot;Thanks a lot;Thanks a lot.@rarzumanyan