server icon indicating copy to clipboard operation
server copied to clipboard

Dynamic batching that supports static batch size with padding

Open ShuaiShao93 opened this issue 1 year ago • 10 comments

Is your feature request related to a problem? Please describe. Since TensorRT has limited support for dynamic shape, the dynamic batch size required by dynamic batcher is not very ideal.

Describe the solution you'd like Support padding batch size to the static batch size when there is not sufficient amount of data.

ShuaiShao93 avatar Apr 17 '24 01:04 ShuaiShao93

Great minds think alike, I'm trying to manually implement padding size from the request side

SunnyGhj avatar Apr 17 '24 07:04 SunnyGhj

Great minds think alike, I'm trying to manually implement padding size from the request side

Does this mean you disabled dynamic batching on triton? This is not ideal, because one of the most important reasons for us to use Triton is dynamic batching

ShuaiShao93 avatar Apr 17 '24 15:04 ShuaiShao93

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

SunnyGhj avatar Apr 17 '24 17:04 SunnyGhj

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

Ok, it sounds like you re-implemented the dynamic batcher at your own client, which is probably not the best investment of time. I hope Triton can support this natively. But thanks for sharing this!

ShuaiShao93 avatar Apr 17 '24 17:04 ShuaiShao93

I think this enhancement makes sense. @GuanLuo / @nnshah1 any additional thoughts?

Tabrizian avatar Apr 19 '24 18:04 Tabrizian

@ShuaiShao93 If I understand correctly - the idea here is to have a static batch defined in the engine but then have the dynamic batcher pad if it sends in batches with smaller size?

Is that something to handle in the server or in the backend? It might be more efficient to pad right before sending it to the engine.

nnshah1 avatar Apr 19 '24 20:04 nnshah1

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

ShuaiShao93 avatar Apr 19 '24 21:04 ShuaiShao93

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

No I get your point - I mean to pad in the TRT backend vs the core server piece - not to pad at the client.

nnshah1 avatar Apr 19 '24 21:04 nnshah1

As a kind of example for our stable diffusion tutorial - I ended up padding / splitting on the model side and allowing the dynamic batcher to provide batches independent of that. (this is just an example and would need to be implement in the TRT engine or triton core)

https://github.com/triton-inference-server/tutorials/blob/cb2ca257000cd14d59642a7aa86b56d054535d73/Popular_Models_Guide/StableDiffusion/backend/diffusion/model.py#L178

nnshah1 avatar Apr 19 '24 21:04 nnshah1

@nnshah1 Ah gotcha. Thanks! Either should work, but sounds better to make this a general feature, and make it a flag in config, in case other backends also want static batch size.

ShuaiShao93 avatar Apr 19 '24 21:04 ShuaiShao93