tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

triton server multi request dynamic_batching not work

Open kazyun opened this issue 10 months ago • 1 comments

System Info

  • GPU A800 80G *2
  • Container:nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
  • Model:Qwen2.5-14B-Instruct

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. adding dynamic_batching in tensorrt_llm/config.txt
    dynamic_batching { preferred_batch_size: [ 32 ] max_queue_delay_microseconds: 10000 default_queue_policy: { max_queue_size: 32 } }

  2. instance_group [ { count: 1 kind : KIND_GPU gpus: [ 0 ] } ]

  3. Simulate 10 concurrent requests.

Expected behavior

Expect these 10 requests to be processed simultaneously and return results.

actual behavior

If the model instance is limited to one, then during the simulation of concurrent requests, the requests will be processed sequentially, one after another. For example, if processing and generating the full text for one request takes 10 seconds, the second request will only begin after 10 seconds, resulting in a total duration of 20 seconds.

additional notes

If you need me to provide the complete config.pbtxt file, feel free to ask。

kazyun avatar Dec 13 '24 07:12 kazyun