tensorrtllm_backend How does Triton implement batch inference

How does Triton implement batch inference

Open lyc728 opened this issue 5 months ago • 1 comments

In the TensorRT-LLM build.py parser.add_argument('--max_batch_size', type=int, default=10)

However, when Triton calls the code, client/inflight_batcher_llm_client.py, it sends grpc requests at the same time, accepts them and returns them. How does it implement batch requests based on the service level? I'm guessing it's time to block, then splice the requests, and then return at the same time?

Feb 02 '24 08:02 lyc728

tensorrtllm_backend tensorrtllm_backend copied to clipboard

How does Triton implement batch inference

tensorrtllm_backend
tensorrtllm_backend copied to clipboard