tensorrtllm_backend
tensorrtllm_backend copied to clipboard
How does Triton implement batch inference
In the TensorRT-LLM build.py
parser.add_argument('--max_batch_size', type=int, default=10)
However, when Triton calls the code, client/inflight_batcher_llm_client.py
,
it sends grpc requests at the same time, accepts them and returns them. How does it implement batch requests based on the service level? I'm guessing it's time to block, then splice the requests, and then return at the same time?