tensorrtllm_backend triton server multi request dynamic

triton server multi request dynamic_batching not work

Open kazyun opened this issue 10 months ago • 1 comments

System Info

GPU A800 80G *2
Container：nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
Model：Qwen2.5-14B-Instruct

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

adding dynamic_batching in tensorrt_llm/config.txt
dynamic_batching { preferred_batch_size: [ 32 ] max_queue_delay_microseconds: 10000 default_queue_policy: { max_queue_size: 32 } }
instance_group [ { count: 1 kind : KIND_GPU gpus: [ 0 ] } ]
Simulate 10 concurrent requests.

Expected behavior

Expect these 10 requests to be processed simultaneously and return results.

actual behavior

If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.

additional notes

If you need me to provide the complete config.pbtxt file, feel free to ask。

Dec 13 '24 07:12 kazyun

tensorrtllm_backend tensorrtllm_backend copied to clipboard

triton server multi request dynamic_batching not work

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard