tensorrtllm_backend
tensorrtllm_backend copied to clipboard
max request capped at 228 while load testing using locust
I deployed Qwen2.5-1.5B-Instruct model on A100 using tensorrtllm_backend using the following parameter values:
TRITON_MAX_BATCH_SIZE=1024 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=32 MAX_QUEUE_SIZE=0 DECOUPLED_MODE=true LOGITS_DATATYPE=TYPE_FP32
the model gets deployed successfully, but I am not able to hit more than 228 request for any number of users and hatch rate for a run time of 60 seconds in locust with constant_throughput(1). for example, for num users = 100, hatch rate = 100, the timings are:
p50 = 26000 ms p90 = 26000 ms p99 = 27000 ms p100 = 27000 ms reqs = 223
what changes do I need to make?