DeepStream-Yolo icon indicating copy to clipboard operation
DeepStream-Yolo copied to clipboard

Under-utilized GPU

Open zetxy opened this issue 3 years ago • 3 comments
trafficstars

Hello,

I have been testing the new updates to this repository and encountered some performance related problems. Here are the specifications of the environment i have used for the experimenting.

OS: Ubuntu 20.04.4 LTS Deep stream: 6.0.1 TensorRT: 8.0.1 GPU: GTX 1060 and A10 CPU: Intel i5 (6 core) and AMD Epyc (8 cores) Model: Yolo4 tiny, 416p, int8 quant (Can provide more information if needed)

I have created the engine, as instructed in the project readme.

Then using trtexec, I measured the performance of the exported engine.

/usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_int8.engine --plugins=./libnvdsinfer_custom_impl_Yolo.so
[07/12/2022-14:36:38] [I] === Performance summary ===
[07/12/2022-14:36:38] [I] Throughput: 589.481 qps
[07/12/2022-14:36:38] [I] Latency: min = 1.77734 ms, max = 9.30664 ms, mean = 1.87308 ms, median = 1.85028 ms, percentile(99%) = 2.00154 ms
[07/12/2022-14:36:38] [I] End-to-End Host Latency: min = 1.79565 ms, max = 9.31885 ms, mean = 1.89146 ms, median = 1.86713 ms, percentile(99%) = 2.02057 ms
[07/12/2022-14:36:38] [I] Enqueue Time: min = 1.62524 ms, max = 9.1012 ms, mean = 1.67503 ms, median = 1.65308 ms, percentile(99%) = 1.793 ms
[07/12/2022-14:36:38] [I] H2D Latency: min = 0.168457 ms, max = 0.226562 ms, mean = 0.173509 ms, median = 0.171875 ms, percentile(99%) = 0.199219 ms
[07/12/2022-14:36:38] [I] GPU Compute Time: min = 1.60059 ms, max = 9.13 ms, mean = 1.69459 ms, median = 1.67212 ms, percentile(99%) = 1.81248 ms
[07/12/2022-14:36:38] [I] D2H Latency: min = 0.00439453 ms, max = 0.0213013 ms, mean = 0.00497595 ms, median = 0.00488281 ms, percentile(99%) = 0.0057373 ms
[07/12/2022-14:36:38] [I] Total Host Walltime: 3.00264 s
[07/12/2022-14:36:38] [I] Total GPU Compute Time: 2.99943 s
[07/12/2022-14:36:38] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/12/2022-14:36:38] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.

However, as stated in the log, the GPU was under-utilized and using nvidia-smi dmon, the sm load peaked at ~70% (or ~50% for A10).

Is this the expected behaviour (as in the GPU utilization is not nearly close to 100%), or is there some oversight on my part?

zetxy avatar Jul 12 '22 12:07 zetxy

The DeepStream isn't fully GPU-based, so it can have bottlenecks on CPU, RAM or other components. I'm working in improvements.

marcoslucianops avatar Jul 13 '22 17:07 marcoslucianops

Hello,

as I mentioned, I have not been using anything deepstream related for the benchmark, given that it is a complex system with many possible places for a bottle-neck.

Therefore, as an effort to isolate the problem, i have been using trtexec (an example with arguments is included in my original post). Still, the gpu during the sampling was under-utilized, and averaged at ~590 qps.

Out of curiosity i repeated the same procedure for the commit from 12 Dec, 2021 and the load was peaking at 99% with around 900 qps.

Thus, to re-formulate my earlier question:

Is this the expected performance (together with this gpu utilization) when profiling using trtexec? Or is there possibly something wrongly configured on my side. Please note that I did not use anything Deepstream related as the part of the benchmarking

zetxy avatar Jul 14 '22 15:07 zetxy

I don't know about the trtexec speed and I don't have the A10 available for testing.

Testing the V100 (p3.2xlarge instance in AWS) on DeepStream:

Model: YOLOv4-Tiny Size: 416 Mode: FP16 Batch-size: 1 FPS: 634.40 GPU usage: 69% CPU usge: 13%

marcoslucianops avatar Jul 15 '22 03:07 marcoslucianops