DeepStream-Yolo icon indicating copy to clipboard operation
DeepStream-Yolo copied to clipboard

why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests

Open ccqedq opened this issue 1 year ago • 4 comments

ccqedq avatar Aug 29 '23 14:08 ccqedq

Probably because the data needs to be copied from the CPU to the GPU and then from the GPU to the CPU again. It's not possible to get the data directly from the GPU in the current DeepStream version.

marcoslucianops avatar Aug 29 '23 19:08 marcoslucianops

from the code: thrust::device_vector<NvDsInferParseObjectInfo> objects(outputSize);

float minPreclusterThreshold = (std::min_element(detectionParams.perClassPreclusterThreshold.begin(), detectionParams.perClassPreclusterThreshold.end())); int threads_per_block = 1024; int number_of_blocks = ((outputSize - 1) / threads_per_block) + 1; decodeTensorYoloECuda<<<number_of_blocks, threads_per_block>>>( thrust::raw_pointer_cast(objects.data()), (float) (boxes.buffer), (float*) (scores.buffer), (float*) (classes.buffer), outputSize, networkInfo.width, networkInfo.height, minPreclusterThreshold); objectList.resize(outputSize); thrust::copy(objects.begin(), objects.end(), objectList.begin()); it seems that the data from whole realization only copied from GPU to CPU, and the decodeTensorYoloECuda function did not copy data from CPU to GPU,

ccqedq avatar Sep 01 '23 01:09 ccqedq

Raw pointer to access on GPU: thrust::raw_pointer_cast GPU to CPU: thrust::copy

marcoslucianops avatar Sep 01 '23 03:09 marcoslucianops

I said CPU to GPU to be easier to understand the process. But the best approach would be full processing in a GPU batch but it's not available in the DeepStream.

marcoslucianops avatar Sep 01 '23 03:09 marcoslucianops