server icon indicating copy to clipboard operation
server copied to clipboard

Python backend: Decoupled=True models with dynamic batching can not be batched

Open protonicage opened this issue 3 months ago • 0 comments

Description I want to use batching or dynamic batching with a decoupled python model. However the usual approach of iterating over requests and appending tensor to a global list does not work. The reason for this is simple: To speak to the model you are forced to use triton_client.stream_infer() to send a request to model from the client. But this function disables batching, creating exactly one request for every stream, so when I send 4 files i either send one after another (no batching) or in parallel on different task streams (also no batching).

So is this behaviour intended? I could do something like implement a custom queue and fill it in the request loop, but then I have to manage memory for complex scenarios like 10k requests while are all flushed in the queue or whatever.

I verified this behaviour because i usually stack tensors after the request loop and saw they all had batch_size 1 -> so every file was used alone.

Why not use default mode? Because I have a tensorrt-llm model as a downstream task which is decoupled to utilize inflight batching. Using default mode for the "wrapper" python model would mean that I have to use grpc inside the server to manage the stream, at least from my limited understanding.

Triton Information 25.08, 2.60.0

Are you using the Triton container or did you build it yourself? triton tensorrt-llm container 25.08 with custom onnx backend

To Reproduce I am actually working on the whisper bls example to make it work with batching, but this is just as a side note.

Expected behavior A clear and concise description of what you expected to happen.

protonicage avatar Sep 24 '25 21:09 protonicage