tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Batching documentation confusing - can you update the docs of main repository please

Open protonicage opened this issue 2 months ago • 0 comments

System Info

For this case not necessary, but I use the 25.09 ngc tensorrt llm container for triton inference server.

Who can help?

@juney-nvidia @kaiyux

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

At the moment a lot of the tutorial/docs and especially links are deprecated/not functioning or the information is not valid anymore. This makes things very confusing. An example:

If you look at this section the links are broken: https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#scheduling

This for example leads to nothing, and searching for the file does not really work as well. https://github.com/NVIDIA/TensorRT-LLM/tree/v0.15.0/docs/source/advanced/batch-manager.md

Also there is some ambiguity about how the method of batching in tensorrt llm can be enabled/changed.

So you basically have the options

  1. static batching
  2. dynamic batching
  3. inflight_batching
  4. infused_inflight_batching.

But in order to set these (from my limited understanding)

  1. As of now you set in config.pbtxt:
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "STATIC_BATCH"
  }
  1. You just set it as usual in the configt.pbtxt as well:
dynamic_batching {
    preferred_batch_size: [ 32, 16, 8, 4 ]
    max_queue_delay_microseconds: 500
    default_queue_policy: { max_queue_size: 256 }
}
  1. and 4. are the same:
  parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_batching"
  }
}

What confuses the heck out of me: I can define all of these in any combination in the same config and it still works. However i am completely clueless how the batching now works internally. Static, Dynamic? Static when I give a batch to the model, dynamic + some kind of fusing when its one request at a time? I can also send batches when STATIC_BATCH is not enabled just with infused_inflight_batching. What does infused_inflight_batching + dynami_batching do?

How can I set a config like this to enable exclusive static batching/dynamic batching or inflight batching?

Expected behavior

Maybe a more clear explanation of the settings and how they work in combination and links that work.

Gemini gave me this explanation which seems to be plausible:

_Capacity Scheduler Policy (CapacitySchedulerPolicy): This is the parameter you were looking at (kMAX_UTILIZATION, kGUARANTEED_NO_EVICT, kSTATIC_BATCH). This controls the scheduling logic—how the backend manages the shared KV cache memory and decides which requests to run in a batch at any given moment.

Batching Strategy (In-Flight Batching/Continuous Batching): This is the fundamental, high-level technique used by the scheduler. For high-performance LLM serving, TensorRT-LLM uses In-Flight Batching (also called Continuous Batching) by default, which is an architectural feature and is not typically a separate, switchable strategy parameter._

This would mean dynamic_batching is probably enabled by setting it together with inflight_batching?

actual behavior

The current docs.

additional notes

I use tensorrt llm on triton so i cant call a scheduler with python and have to use config.pbtxt.

protonicage avatar Oct 10 '25 11:10 protonicage