tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

InFlightBatching seems not working

Open larme opened this issue 9 months ago • 3 comments

System Info

  • CPU: amd64
  • OS: Debian 12
  • GPU: nvidia rtx4000 ada
  • GPU driver: 535.161
  • TensorRT-LLM version: 0.8
  • tensorrtllm_backend version: 0.8

Who can help?

@kaiyux

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Followed exactly steps of: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/. The only change is setting kv_cache_free_gpu_mem_fraction=0.95.

Expected behavior

Then I run 2 copies of https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py in the same time. The first script will finish within 30 seconds. I expect the second one will finish around the same time (about 30 seconds)

actual behavior

The second will only finish after 60 seconds. Hence it seems like the batching is not working and every request will block other requests come after it.

additional notes

Very similar to #189, but the user reported that his issue was fixed after 0.6.1.

larme avatar May 06 '24 21:05 larme