server Low gpu utilization

Description I get around 25% gpu utilization (got that number using gpustats util). That seems to be pretty low. I have one yolov5l model converted to tensorrt in triton server. I have A4000 gpu and several cameras. I don’t get all 25fps, but gpu utilization does not do higher than 25%. Cpu is also used not more than 50% on all cores. How to get more performance? How should I troubleshoot the problem?

Triton Information 21.09

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce

https://github.com/huynhbaobk/tensorrt-triton-yolov5/tree/dev.gpu_a4000

Expected behavior Higher usage of gpu, more performance

Aug 19 '22 19:08 ArgoHA

@ArgoHA It may depend on many things. For example, small batch size at inference, or frames from stream goes too slow. By the way, GPU Util actually is not a primary performance metric, it's just showing you "how many cores of GPU is loaded", you should check latency and throughput metrics, depending on your task.

Let's imagine this task: We have 3 CCTV cameras, which produce 10 frames per second, we want to get real-time object detection processing in this system.

So, to get real-time we should have at least 1 / (10 FPS) latency, which means that we can process 1 frame before the client will be sent the next, so minimal latency is 0.1s or 100ms. So, we have 3 cameras, which means that we should process 3 images simultaneously with a latency 100ms, so 3 * 10 = 30 images/s - is your minimal throughput.

Aug 20 '22 14:08 alxmamaev

@alxmamaev So do you mean I need to measure time which triton takes to process one image? I just calculated FPS with time.perf_counter() before and after prediction and then counted FPS. Shouldn't that show me throughput? You noticed batch size. I have just one image in one batch, because I would like to get real time object detection. If I would wait to get next (second) frame from camera - I already am out of "real time". How can I have bigger batch size and still be in real time? Maybe I should collect one frame from each camera in parallel in one batch and then send to triton?

Aug 21 '22 06:08 ArgoHA

@ArgoHA You may concatenate images from different cameras before sending them to triton, or just use dynamic batching feature in triton.

Aug 21 '22 09:08 alxmamaev

We were having low GPU utilization problem as well. We exported Yolov7 PyTorch to TensorRT FP32 (batch-size=32), created 2 setups with the same .engine (or .plan) file, 1 with and 1 without Triton on different GPUs. Unfortunately, we always observed low GPU utilization on Triton (not exceeding %40) but almost %100 on non Triton setup. We also increased number of instances on the same GPU. We never observed an increase in the GPU utilization. We used batch size of 32 on client side and concurrency level >=4. FPS was always much higher on non Triton setup (like 2.5x slow on Triton).

Sep 23 '22 13:09 alercelik

@alercelik I also get a lot faster non triton inference (using yolov5's detect.py)

Please let me know if you find out something!

Oct 02 '22 06:10 ArgoHA

During perf_client, GPU utilization is like 100%. There might a bug or a setting to enable full utilization during standard inference. We did not do any further research

Oct 05 '22 07:10 alercelik

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

Nov 22 '22 03:11 jbkyang-nvi

server server copied to clipboard

Low gpu utilization

server
server copied to clipboard