tensorrtllm_backend max request capped at 228 while load testing using locust

max request capped at 228 while load testing using locust

Open adityarajsahu opened this issue 6 months ago • 0 comments

I deployed Qwen2.5-1.5B-Instruct model on A100 using tensorrtllm_backend using the following parameter values:

TRITON_MAX_BATCH_SIZE=1024 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=32 MAX_QUEUE_SIZE=0 DECOUPLED_MODE=true LOGITS_DATATYPE=TYPE_FP32

the model gets deployed successfully, but I am not able to hit more than 228 request for any number of users and hatch rate for a run time of 60 seconds in locust with constant_throughput(1). for example, for num users = 100, hatch rate = 100, the timings are:

p50 = 26000 ms p90 = 26000 ms p99 = 27000 ms p100 = 27000 ms reqs = 223

what changes do I need to make?

Apr 23 '25 03:04 adityarajsahu

tensorrtllm_backend tensorrtllm_backend copied to clipboard

max request capped at 228 while load testing using locust

tensorrtllm_backend
tensorrtllm_backend copied to clipboard