tensorrtllm_backend
tensorrtllm_backend copied to clipboard
InFlightBatching seems not working
System Info
- CPU: amd64
- OS: Debian 12
- GPU: nvidia rtx4000 ada
- GPU driver: 535.161
- TensorRT-LLM version: 0.8
- tensorrtllm_backend version: 0.8
Who can help?
@kaiyux
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Followed exactly steps of: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/. The only change is setting kv_cache_free_gpu_mem_fraction=0.95
.
Expected behavior
Then I run 2 copies of https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py in the same time. The first script will finish within 30 seconds. I expect the second one will finish around the same time (about 30 seconds)
actual behavior
The second will only finish after 60 seconds. Hence it seems like the batching is not working and every request will block other requests come after it.
additional notes
Very similar to #189, but the user reported that his issue was fixed after 0.6.1.