tensorrtllm_backend InFlightBatching seems not working

InFlightBatching seems not working

Open larme opened this issue 9 months ago • 3 comments

System Info

CPU: amd64
OS: Debian 12
GPU: nvidia rtx4000 ada
GPU driver: 535.161
TensorRT-LLM version: 0.8
tensorrtllm_backend version: 0.8

Who can help?

@kaiyux

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Followed exactly steps of: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/. The only change is setting kv_cache_free_gpu_mem_fraction=0.95.

Expected behavior

Then I run 2 copies of https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py in the same time. The first script will finish within 30 seconds. I expect the second one will finish around the same time (about 30 seconds)

actual behavior

The second will only finish after 60 seconds. Hence it seems like the batching is not working and every request will block other requests come after it.

additional notes

Very similar to #189, but the user reported that his issue was fixed after 0.6.1.

May 06 '24 21:05 larme

tensorrtllm_backend tensorrtllm_backend copied to clipboard

InFlightBatching seems not working

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

tensorrtllm_backend
tensorrtllm_backend copied to clipboard