tensorrtllm_backend
tensorrtllm_backend copied to clipboard
triton server multi request dynamic_batching not work
System Info
- GPU A800 80G *2
- Container:nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
- Model:Qwen2.5-14B-Instruct
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
-
adding dynamic_batching in tensorrt_llm/config.txt
dynamic_batching { preferred_batch_size: [ 32 ] max_queue_delay_microseconds: 10000 default_queue_policy: { max_queue_size: 32 } } -
instance_group [ { count: 1 kind : KIND_GPU gpus: [ 0 ] } ]
-
Simulate 10 concurrent requests.
Expected behavior
Expect these 10 requests to be processed simultaneously and return results.
actual behavior
If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.
additional notes
If you need me to provide the complete config.pbtxt file, feel free to ask。