DeepStream-Yolo why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests

why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests

Open ccqedq opened this issue 1 year ago • 4 comments

Aug 29 '23 14:08 ccqedq

Probably because the data needs to be copied from the CPU to the GPU and then from the GPU to the CPU again. It's not possible to get the data directly from the GPU in the current DeepStream version.

Aug 29 '23 19:08 marcoslucianops

from the code: thrust::device_vector<NvDsInferParseObjectInfo> objects(outputSize);

float minPreclusterThreshold = (std::min_element(detectionParams.perClassPreclusterThreshold.begin(), detectionParams.perClassPreclusterThreshold.end())); int threads_per_block = 1024; int number_of_blocks = ((outputSize - 1) / threads_per_block) + 1; decodeTensorYoloECuda<<<number_of_blocks, threads_per_block>>>( thrust::raw_pointer_cast(objects.data()), (float) (boxes.buffer), (float*) (scores.buffer), (float*) (classes.buffer), outputSize, networkInfo.width, networkInfo.height, minPreclusterThreshold); objectList.resize(outputSize); thrust::copy(objects.begin(), objects.end(), objectList.begin()); it seems that the data from whole realization only copied from GPU to CPU, and the decodeTensorYoloECuda function did not copy data from CPU to GPU,

Sep 01 '23 01:09 ccqedq

Raw pointer to access on GPU: thrust::raw_pointer_cast GPU to CPU: thrust::copy

Sep 01 '23 03:09 marcoslucianops

I said CPU to GPU to be easier to understand the process. But the best approach would be full processing in a GPU batch but it's not available in the DeepStream.

Sep 01 '23 03:09 marcoslucianops

DeepStream-Yolo DeepStream-Yolo copied to clipboard

why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests

DeepStream-Yolo
DeepStream-Yolo copied to clipboard