Dynamic batching for large requests
Hi Team,
I'm encountering an issue with dynamic batching performance in a Triton Inference Server ensemble for a Vision-Language Model (VLM).
My Setup:
Ensemble Structure: preprocessing step: Python backend vision-encoder step: Python backend postprocessing step: Python backend LLM step: trtllm-backend max_batch_size: Set to 8 for all steps in the ensemble.
Observed Behavior:
Small Requests: When sending relatively small requests (e.g., smaller images or shorter prompts), dynamic batching appears to work as expected. I observe Inference_count / execution_count values close to max_batch_size (8), indicating efficient batch formation.
Large Requests (The Issue): When I send larger requests, the dynamic batcher's effectiveness significantly degrades. In my specific case, each request includes:
A 2MP image, which is encoded by the vision-encoder step.
A text prompt of approximately 700-800 tokens.
For these large requests, I can barely achieve batch sizes of 2 (Inference_count / execution_count is around 2), despite the max_batch_size being 8.
Profiling:
I am using perf_analyzer to profile the ensemble, sending 10 concurrent requests. During these large request tests, GPU utilization remains at 100%.
Questions:
- Is this behavior normal for Triton's dynamic batcher when requests are large and potentially consume more GPU memory or computational resources per item?
- I suspect the default batcher implementation for the dynamic batcher is
volume_batchingwhich can be found here. Can you please confirm that? Because the implementation ofvolume_bathcingaligns very well with the behaviour I'm seeing from dynamic batchers. If that is the case, how can I override the volume size so the batcher can fit more requests before running out of volume?