TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Executor API: How to get throughput

Open khayamgondal opened this issue 9 months ago • 1 comments

I am looking at benchmarks/python/kv_cache_offload /benchmark.py example and trying to figure out how can I get throughput. This example gets TTFT from executor.get_latest_iteration_stats(). I looked at get_latest_iteration_stats() definitions but didn't find any information regarding how to get throughput. Here is what is available via get_latest_iteration_stats() function 'cpu_mem_usage', 'cross_kv_cache_stats', 'gpu_mem_usage', 'inflight_batching_stats', 'iter', 'iter_latency_ms', 'kv_cache_stats', 'max_num_active_requests', 'new_active_requests_queue_latency_ms', 'num_active_requests', 'num_completed_requests', 'num_new_active_requests', 'num_queued_requests', 'pinned_mem_usage', 'static_batching_stats', 'timestamp', 'to_json_str']

khayamgondal avatar Mar 28 '25 16:03 khayamgondal

Hi @khayamgondal

The throughput information should be stored in inflight_batching_stats .

Also we are moving to use trtllm-bench to consolidate the performance benchmarking process which I would suggest you to refer.

For any specific question about this kv_cache_offload benchmark script, @SimengLiu-nv may help.

For any question about trtllm-bench, @FrankD412 @kaiyux @jiahanc can help.

Thanks June

juney-nvidia avatar Mar 29 '25 07:03 juney-nvidia

Hi @khayamgondal , the end-to-end throughput statistics are calculated not directly reported. For example, Token Throughput (tokens/sec) = total_output_tokens / total_latency. Request Throughput (req/sec) = total_num_requests / total_latency. For the value of total_latency, you can convert the logged E2E TIME (ms) to second.

SimengLiu-nv avatar Mar 31 '25 03:03 SimengLiu-nv

Thanks @juney-nvidia how do I usetrtllm.KvCacheConfigwith trtllm-bench?

khayamgondal avatar Mar 31 '25 18:03 khayamgondal

Thanks @juney-nvidia how do I usetrtllm.KvCacheConfigwith trtllm-bench?

@khayamgondal There is a --extra_llm_api_options argument provided by trtllm-bench that allows you to specify any custom configurations following LlmArgs data structure, and that includes kv_cache_config.

Let me know for further questions, thanks.

kaiyux avatar Apr 01 '25 03:04 kaiyux

@khayamgondal -- you should be able to specify a KV cache config via an extra options YML file. It would look something like:

kv_cache_config:
  free_gpu_memory_fraction: 0.95
  enable_block_reuse: true

You can find the specific arguments to the KvCacheConfig [here])https://github.com/NVIDIA/TensorRT-LLM/blob/8bb3eea285db15c3b54c66230eb2701505fc863f/tensorrt_llm/llmapi/llm_args.py#L520). As @kaiyux mentioned, you can pass this in via --extra_llm_api_options (and it isn't limited to just the KV cache. A caveat is that if you modify an entry, none of the configuration set by trtllm-bench will hold so make sure you set everything (even if trtllm-bench does set it).

FrankD412 avatar Apr 01 '25 04:04 FrankD412

Thanks @FrankD412 is there a way to dump default values set by trtllm-bench ?

Based on llm_ags.py looks like I need following

kv_cache_config:
    enable_block_reuse: true
    max_tokens: 17800
    max_attention_window: null
    sink_token_length: null
    free_gpu_memory_fraction: 0.9
    host_cache_size: null
    onboard_blocks: true
    cross_kv_cache_fraction: null
    secondary_offload_min_priority: null
    event_buffer_max_size: 0
    enable_partial_reuse: true
    copy_on_partial_reuse: true

khayamgondal avatar Apr 01 '25 20:04 khayamgondal

@khayamgondal -- at the moment there isn't, there is some ongoing effort to make the configuration classes that the LLM API uses pure Python so that we can use Pydantic to serialize/deserialize a configuration YAML, but that's currently still a WIP.

For the KV configuration I think the only thing you'll really need to worry about is the free_gpu_memory_fraction, and if you don't set a kwarg it's just the default.

FrankD412 avatar Apr 02 '25 15:04 FrankD412

Thanks @FrankD412 my goal is to play with free_gpu_memory_fraction and host_cache_size to adjust how much KV cache resides on GPU and CPU. I see I can make change at TensorRT-LLM/tensorrt_llm/bench/benchmark/throughput.py to adjust free_gpu_memory_fraction but I don't see any option for host_cache_size in that file. If I can set these two values in the throughput.py script that will be easier for me instead of creating a YAML file for config.

@optgroup.option(
   "--kv_cache_free_gpu_mem_fraction",
   type=float,
   default=.90,
   help="The percentage of memory to use for KV Cache after model load.",
)

khayamgondal avatar Apr 02 '25 15:04 khayamgondal

Got it -- thanks for the feedback @khayamgondal

With the new configuration classes, there may be a way we could add a general CLI option to set properties that don't have a specific CLI option. I'll keep that in mind as we move towards it and see if there's a way to achieve it. We decided to try and limit the options on the CLI to make the benchmark easier to use and only expose the options we thought were the biggest knobs a user would want to tune. For now, the YAML file would be the way to handle it.

FrankD412 avatar Apr 02 '25 15:04 FrankD412

@khayamgondal You can try adding the host_cache_size option in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.

SimengLiu-nv avatar Apr 02 '25 15:04 SimengLiu-nv

Ok @FrankD412 so with the YAML way I have to provide all of the following configs right even if I just want to use a few different values for KVConfig? TensorRT-LLM/tensorrt_llm/llmapi/llm_args.py

    field_mapping = {
        "quant_config": QuantConfig,
        "calib_config": CalibConfig,
        "build_config": BuildConfig,
        "kv_cache_config": KvCacheConfig,
        "decoding_config": DecodingConfig,
        "enable_build_cache": BuildCacheConfig,
        "peft_cache_config": PeftCacheConfig,
        "scheduler_config": SchedulerConfig,
        "speculative_config": DecodingBaseConfig,
        "batching_type": BatchingType,
        "extended_runtime_perf_knob_config": ExtendedRuntimePerfKnobConfig,
        "pytorch_backend_config": PyTorchConfig,
    }

khayamgondal avatar Apr 02 '25 15:04 khayamgondal

@khayamgondal -- you don't need to provide all of those, that's just a mapping so that the function knows what class to use to initialize each specific kwarg.

Edit:

@khayamgondal You can try adding the host_cache_size option in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.

Just saw this, if you're working with the codebase and know what you want to set it to then this can work.

FrankD412 avatar Apr 02 '25 15:04 FrankD412

@SimengLiu-nv quick architectural question: If I specify a lower value for free_gpu_memory_fraction than what is required for KV cache, does that forces KV to be offloaded to CPU, or it cause KV cache to recompute on the fly?

khayamgondal avatar Apr 02 '25 15:04 khayamgondal

@khayamgondal You can think of on-GPU kv_cache memory as serving two main purposes:

  1. Per-iteration allocation: At the start of each iteration, enough GPU memory must be available to hold the kv_cache that will be generated during that step. If the free_gpu_memory_fraction is too low to meet this requirement, execution will fail. This effectively sets a hard lower bound.
  2. Accumulated cache management: Beyond the memory needed for the current iteration, the remaining GPU memory is used to store accumulated (i.e., past) kv_cache. If this accumulated cache exceeds the available space, the least recently used entries are offloaded to CPU memory—up to the size limit set by host_cache_size.

So, to directly answer your question:

If free_gpu_memory_fraction is too low to support a single iteration’s kv_cache, the run will fail.

If there's enough space for a single iteration, then offloading to CPU can occur when the total on-GPU kv_cache usage exceeds the available memory. When a new input requires previously generated cache, the cache manager will first search in GPU and CPU memory (if offloaded). If the needed cache has been evicted from both GPU and CPU, it will be recomputed.

SimengLiu-nv avatar Apr 02 '25 16:04 SimengLiu-nv

Due to the issue’s prolonged inactivity, I’m closing it. I hope the comments above have addressed the question. If the problem persists in the latest release, please open a new issue. Thanks!

karljang avatar Dec 17 '25 00:12 karljang