TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[Question] Modifying the Batch Scheduling Policy in the trtllm-bench CLI

Open byStander9 opened this issue 9 months ago • 2 comments

I'm using tensorrt-llm v0.17.0.post1 on docker container image nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3. I'm trying to build and benchmark a TRT-LLM engine using the trtllm-bench CLI.

Here is the command I'm using:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct throughput --dataset /path/to/synthetic/dataset --engine_dir /path/to/engine/dir

When I run this command, the scheduling policy defaults to Guaranteed_no_evict.

I checked the available CLI options using --help, but it seems there is no option to specify a custom scheduling policy directly. The relevant options displayed are:

Click to expand --help output
Options:
  Engine run configuration.:
    --engine_dir PATH             
    --backend [pytorch]           
    --extra_llm_api_options TEXT  
    --max_batch_size INTEGER      
    --max_num_tokens INTEGER      
    --max_seq_len INTEGER         
    --beam_width INTEGER          
    --kv_cache_free_gpu_mem_fraction FLOAT

  Engine Input Configuration:
    --dataset PATH                
    --num_requests INTEGER        
    --warmup INTEGER              
    --tp INTEGER                  
    --pp INTEGER                  
    --target_input_len INTEGER RANGE
    --target_output_len INTEGER RANGE

  Request Load Control Options: [mutually_exclusive]
    --concurrency INTEGER         
    
  --streaming                     
  --help                          

Since there was no option for setting the scheduling policy from the CLI, I looked into the source code. In:

/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bench/benchmark/throughput.py

I found the following code around lines 246–251:

# Update configuration with runtime options
exec_settings["settings_config"]["kv_cache_percent"] = kv_cache_percent
exec_settings["settings_config"]["max_batch_size"] = runtime_max_bs
exec_settings["settings_config"]["max_num_tokens"] = runtime_max_tokens
exec_settings["settings_config"]["beam_width"] = beam_width
exec_settings["settings_config"]["scheduler_policy"] = IFBSchedulingPolicy.NO_EVICT

Then, in:

/usr/local/lib/python3.12/dist-packages/tensorrt_llm/bench/dataclasses/enums.py

I found the IFBSchedulingPolicy class and updated it as follows:

class IFBSchedulingPolicy(MultiValueEnum):
    #It's not a typo 'UTILIZTION'.
    MAX_UTILIZTION = CapacitySchedulerPolicy.MAX_UTILIZATION, MAX_UTIL, "max_utilization"
    NO_EVICT = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT, NO_EVICT, "guaranteed_no_evict"
    STATIC = "Static", "static"

So I modified throughput.py like this:

exec_settings["settings_config"]["scheduler_policy"] = IFBSchedulingPolicy.MAX_UTILIZTION

After making this change and re-running the benchmark with the same command, I noticed that the output shows the scheduling policy as Max_Utilization, which suggests the change was applied.

However, I’m wondering: even though the output reflects the change, is the Max_Utilization scheduling policy actually being applied to the model inference? Or is it only a cosmetic change in the logging output?

Thanks in advance!

byStander9 avatar Mar 27 '25 10:03 byStander9

@FrankD412 @kaiyux @jiahanc

Hi Frank/Kaiyu/Cyrus, can you help confirm the question from the community?

Thanks June

juney-nvidia avatar Mar 27 '25 12:03 juney-nvidia

Hi @byStander9 , Thanks for the question. If you change the scheduling policy by setttingexec_settings["settings_config"]["scheduler_policy"] , it will change the scheduling policy during inference, because this is an initialization param for llm-api. For more details, the log: [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION is logged at c++ batch manager

jiahanc avatar Mar 27 '25 17:03 jiahanc

Thanks for the help! @jiahanc

I have one more question, if you don't mind. Do you know which file contains the variable where the latency of individual requests within a batch is stored?

The benchmark output only shows the total latency, but I'd like to access the per-request latency values if possible.

byStander9 avatar Mar 28 '25 02:03 byStander9

Hi @byStander9 , The requests' latency are recorded in this request_latencies.

jiahanc avatar Mar 28 '25 03:03 jiahanc

Really appreciate the help. Thanks a lot again!

byStander9 avatar Mar 28 '25 03:03 byStander9