tensorrtllm_backend Triton Inference Server Stops Processing Requests under High Traffic, GPU Utilization Stuck at 100%

Bug Description: When the Triton Inference Server experiences high traffic, it appears to freeze and stops processing incoming requests. During this time, the GPU utilization reaches 100% and stays stuck at that level, but no further requests are processed.

This issue leads to a bottleneck where the server no longer responds to requests until it is restarted or traffic decreases significantly.

Steps to Reproduce:

Deploy Triton Inference Server on a GPU-based environment. Send a high volume of concurrent inference requests (e.g., thousands of requests per second). Monitor the GPU utilization and request processing. Observed Behavior:

GPU utilization spikes to 100% and remains stuck at that level. No new requests are processed after the spike. Server becomes unresponsive. Expected Behavior:

Triton Inference Server should continue processing requests and properly manage traffic load without freezing. GPU utilization should fluctuate based on the load but not lead to a total freeze. Environment:

GPU model: 2xH100 CUDA version: 11.8 TensorRT version: 0.10.0.dev2024043000 OS: ubuntu 22

Aug 19 '24 05:08 MrD005

same problem on 2 * L20

Sep 03 '24 09:09 hcnhcn012

Can confirm this is happening. I'm not entirely sure this is due to high load or if there is a poisoned request that makes it crash, but I have managed to reproduce this by merely bombarding the server with requests.

After a point, the server stalls and refuses to accept new requests, even after all the other requests have been fulfilled.

Sep 20 '24 17:09 gabriel-peracio

@gabriel-peracio @hcnhcn012 @MrD005 were you able to find a fix for this ?

Jun 28 '25 18:06 jayakommuru