DeepStream-Yolo
DeepStream-Yolo copied to clipboard
why GPU bbox parser is slightly slower than CPU bbox parser on V100 GPU tests
Probably because the data needs to be copied from the CPU to the GPU and then from the GPU to the CPU again. It's not possible to get the data directly from the GPU in the current DeepStream version.
from the code: thrust::device_vector<NvDsInferParseObjectInfo> objects(outputSize);
float minPreclusterThreshold = (std::min_element(detectionParams.perClassPreclusterThreshold.begin(), detectionParams.perClassPreclusterThreshold.end())); int threads_per_block = 1024; int number_of_blocks = ((outputSize - 1) / threads_per_block) + 1; decodeTensorYoloECuda<<<number_of_blocks, threads_per_block>>>( thrust::raw_pointer_cast(objects.data()), (float) (boxes.buffer), (float*) (scores.buffer), (float*) (classes.buffer), outputSize, networkInfo.width, networkInfo.height, minPreclusterThreshold); objectList.resize(outputSize); thrust::copy(objects.begin(), objects.end(), objectList.begin()); it seems that the data from whole realization only copied from GPU to CPU, and the decodeTensorYoloECuda function did not copy data from CPU to GPU,
Raw pointer to access on GPU: thrust::raw_pointer_cast
GPU to CPU: thrust::copy
I said CPU to GPU to be easier to understand the process. But the best approach would be full processing in a GPU batch but it's not available in the DeepStream.