tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

How does Triton implement batch inference

Open lyc728 opened this issue 5 months ago • 1 comments

In the TensorRT-LLM build.py parser.add_argument('--max_batch_size', type=int, default=10)

However, when Triton calls the code, client/inflight_batcher_llm_client.py, it sends grpc requests at the same time, accepts them and returns them. How does it implement batch requests based on the service level? I'm guessing it's time to block, then splice the requests, and then return at the same time?

lyc728 avatar Feb 02 '24 08:02 lyc728