server icon indicating copy to clipboard operation
server copied to clipboard

Dynamic batching for large requests

Open nzarif opened this issue 6 months ago • 0 comments

Hi Team,

I'm encountering an issue with dynamic batching performance in a Triton Inference Server ensemble for a Vision-Language Model (VLM).

My Setup:

Ensemble Structure: preprocessing step: Python backend vision-encoder step: Python backend postprocessing step: Python backend LLM step: trtllm-backend max_batch_size: Set to 8 for all steps in the ensemble.

Observed Behavior:

Small Requests: When sending relatively small requests (e.g., smaller images or shorter prompts), dynamic batching appears to work as expected. I observe Inference_count / execution_count values close to max_batch_size (8), indicating efficient batch formation.

Large Requests (The Issue): When I send larger requests, the dynamic batcher's effectiveness significantly degrades. In my specific case, each request includes:

A 2MP image, which is encoded by the vision-encoder step. A text prompt of approximately 700-800 tokens. For these large requests, I can barely achieve batch sizes of 2 (Inference_count / execution_count is around 2), despite the max_batch_size being 8. Profiling: I am using perf_analyzer to profile the ensemble, sending 10 concurrent requests. During these large request tests, GPU utilization remains at 100%.

Questions:

  • Is this behavior normal for Triton's dynamic batcher when requests are large and potentially consume more GPU memory or computational resources per item?
  • I suspect the default batcher implementation for the dynamic batcher is volume_batching which can be found here. Can you please confirm that? Because the implementation of volume_bathcing aligns very well with the behaviour I'm seeing from dynamic batchers. If that is the case, how can I override the volume size so the batcher can fit more requests before running out of volume?

nzarif avatar Jun 04 '25 21:06 nzarif