[Question] Modifying the Batch Scheduling Policy in the trtllm-bench CLI
I'm using tensorrt-llm v0.17.0.post1 on docker container image nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3.
I'm trying to build and benchmark a TRT-LLM engine using the trtllm-bench CLI.
Here is the command I'm using:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct throughput --dataset /path/to/synthetic/dataset --engine_dir /path/to/engine/dir
When I run this command, the scheduling policy defaults to Guaranteed_no_evict.
I checked the available CLI options using --help, but it seems there is no option to specify a custom scheduling policy directly. The relevant options displayed are:
Click to expand --help output
Options:
Engine run configuration.:
--engine_dir PATH
--backend [pytorch]
--extra_llm_api_options TEXT
--max_batch_size INTEGER
--max_num_tokens INTEGER
--max_seq_len INTEGER
--beam_width INTEGER
--kv_cache_free_gpu_mem_fraction FLOAT
Engine Input Configuration:
--dataset PATH
--num_requests INTEGER
--warmup INTEGER
--tp INTEGER
--pp INTEGER
--target_input_len INTEGER RANGE
--target_output_len INTEGER RANGE
Request Load Control Options: [mutually_exclusive]
--concurrency INTEGER
--streaming
--help
Since there was no option for setting the scheduling policy from the CLI, I looked into the source code. In:
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bench/benchmark/throughput.py
I found the following code around lines 246–251:
# Update configuration with runtime options
exec_settings["settings_config"]["kv_cache_percent"] = kv_cache_percent
exec_settings["settings_config"]["max_batch_size"] = runtime_max_bs
exec_settings["settings_config"]["max_num_tokens"] = runtime_max_tokens
exec_settings["settings_config"]["beam_width"] = beam_width
exec_settings["settings_config"]["scheduler_policy"] = IFBSchedulingPolicy.NO_EVICT
Then, in:
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bench/dataclasses/enums.py
I found the IFBSchedulingPolicy class and updated it as follows:
class IFBSchedulingPolicy(MultiValueEnum):
#It's not a typo 'UTILIZTION'.
MAX_UTILIZTION = CapacitySchedulerPolicy.MAX_UTILIZATION, MAX_UTIL, "max_utilization"
NO_EVICT = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT, NO_EVICT, "guaranteed_no_evict"
STATIC = "Static", "static"
So I modified throughput.py like this:
exec_settings["settings_config"]["scheduler_policy"] = IFBSchedulingPolicy.MAX_UTILIZTION
After making this change and re-running the benchmark with the same command, I noticed that the output shows the scheduling policy as Max_Utilization, which suggests the change was applied.
However, I’m wondering: even though the output reflects the change, is the Max_Utilization scheduling policy actually being applied to the model inference? Or is it only a cosmetic change in the logging output?
Thanks in advance!
@FrankD412 @kaiyux @jiahanc
Hi Frank/Kaiyu/Cyrus, can you help confirm the question from the community?
Thanks June
Hi @byStander9 ,
Thanks for the question.
If you change the scheduling policy by setttingexec_settings["settings_config"]["scheduler_policy"] , it will change the scheduling policy during inference, because this is an initialization param for llm-api.
For more details, the log: [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION is logged at c++ batch manager
Thanks for the help! @jiahanc
I have one more question, if you don't mind. Do you know which file contains the variable where the latency of individual requests within a batch is stored?
The benchmark output only shows the total latency, but I'd like to access the per-request latency values if possible.
Hi @byStander9 , The requests' latency are recorded in this request_latencies.
Really appreciate the help. Thanks a lot again!