Executor API: How to get throughput
I am looking at benchmarks/python/kv_cache_offload
/benchmark.py example and trying to figure out how can I get throughput. This example gets TTFT from executor.get_latest_iteration_stats(). I looked at get_latest_iteration_stats() definitions but didn't find any information regarding how to get throughput.
Here is what is available via get_latest_iteration_stats() function
'cpu_mem_usage', 'cross_kv_cache_stats', 'gpu_mem_usage', 'inflight_batching_stats', 'iter', 'iter_latency_ms', 'kv_cache_stats', 'max_num_active_requests', 'new_active_requests_queue_latency_ms', 'num_active_requests', 'num_completed_requests', 'num_new_active_requests', 'num_queued_requests', 'pinned_mem_usage', 'static_batching_stats', 'timestamp', 'to_json_str']
Hi @khayamgondal
The throughput information should be stored in inflight_batching_stats .
Also we are moving to use trtllm-bench to consolidate the performance benchmarking process which I would suggest you to refer.
For any specific question about this kv_cache_offload benchmark script, @SimengLiu-nv may help.
For any question about trtllm-bench, @FrankD412 @kaiyux @jiahanc can help.
Thanks June
Hi @khayamgondal , the end-to-end throughput statistics are calculated not directly reported. For example, Token Throughput (tokens/sec) = total_output_tokens / total_latency. Request Throughput (req/sec) = total_num_requests / total_latency. For the value of total_latency, you can convert the logged E2E TIME (ms) to second.
Thanks @juney-nvidia how do I usetrtllm.KvCacheConfigwith trtllm-bench?
Thanks @juney-nvidia how do I use
trtllm.KvCacheConfigwithtrtllm-bench?
@khayamgondal There is a --extra_llm_api_options argument provided by trtllm-bench that allows you to specify any custom configurations following LlmArgs data structure, and that includes kv_cache_config.
Let me know for further questions, thanks.
@khayamgondal -- you should be able to specify a KV cache config via an extra options YML file. It would look something like:
kv_cache_config:
free_gpu_memory_fraction: 0.95
enable_block_reuse: true
You can find the specific arguments to the KvCacheConfig [here])https://github.com/NVIDIA/TensorRT-LLM/blob/8bb3eea285db15c3b54c66230eb2701505fc863f/tensorrt_llm/llmapi/llm_args.py#L520). As @kaiyux mentioned, you can pass this in via --extra_llm_api_options (and it isn't limited to just the KV cache. A caveat is that if you modify an entry, none of the configuration set by trtllm-bench will hold so make sure you set everything (even if trtllm-bench does set it).
Thanks @FrankD412 is there a way to dump default values set by trtllm-bench ?
Based on llm_ags.py looks like I need following
kv_cache_config:
enable_block_reuse: true
max_tokens: 17800
max_attention_window: null
sink_token_length: null
free_gpu_memory_fraction: 0.9
host_cache_size: null
onboard_blocks: true
cross_kv_cache_fraction: null
secondary_offload_min_priority: null
event_buffer_max_size: 0
enable_partial_reuse: true
copy_on_partial_reuse: true
@khayamgondal -- at the moment there isn't, there is some ongoing effort to make the configuration classes that the LLM API uses pure Python so that we can use Pydantic to serialize/deserialize a configuration YAML, but that's currently still a WIP.
For the KV configuration I think the only thing you'll really need to worry about is the free_gpu_memory_fraction, and if you don't set a kwarg it's just the default.
Thanks @FrankD412 my goal is to play with free_gpu_memory_fraction and host_cache_size to adjust how much KV cache resides on GPU and CPU. I see I can make change at TensorRT-LLM/tensorrt_llm/bench/benchmark/throughput.py to adjust free_gpu_memory_fraction but I don't see any option for host_cache_size in that file. If I can set these two values in the throughput.py script that will be easier for me instead of creating a YAML file for config.
@optgroup.option(
"--kv_cache_free_gpu_mem_fraction",
type=float,
default=.90,
help="The percentage of memory to use for KV Cache after model load.",
)
Got it -- thanks for the feedback @khayamgondal
With the new configuration classes, there may be a way we could add a general CLI option to set properties that don't have a specific CLI option. I'll keep that in mind as we move towards it and see if there's a way to achieve it. We decided to try and limit the options on the CLI to make the benchmark easier to use and only expose the options we thought were the biggest knobs a user would want to tune. For now, the YAML file would be the way to handle it.
@khayamgondal You can try adding the host_cache_size option in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.
Ok @FrankD412 so with the YAML way I have to provide all of the following configs right even if I just want to use a few different values for KVConfig? TensorRT-LLM/tensorrt_llm/llmapi/llm_args.py
field_mapping = {
"quant_config": QuantConfig,
"calib_config": CalibConfig,
"build_config": BuildConfig,
"kv_cache_config": KvCacheConfig,
"decoding_config": DecodingConfig,
"enable_build_cache": BuildCacheConfig,
"peft_cache_config": PeftCacheConfig,
"scheduler_config": SchedulerConfig,
"speculative_config": DecodingBaseConfig,
"batching_type": BatchingType,
"extended_runtime_perf_knob_config": ExtendedRuntimePerfKnobConfig,
"pytorch_backend_config": PyTorchConfig,
}
@khayamgondal -- you don't need to provide all of those, that's just a mapping so that the function knows what class to use to initialize each specific kwarg.
Edit:
@khayamgondal You can try adding the
host_cache_sizeoption in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataclasses/configuration.py#L215-L220.
Just saw this, if you're working with the codebase and know what you want to set it to then this can work.
@SimengLiu-nv quick architectural question: If I specify a lower value for free_gpu_memory_fraction than what is required for KV cache, does that forces KV to be offloaded to CPU, or it cause KV cache to recompute on the fly?
@khayamgondal You can think of on-GPU kv_cache memory as serving two main purposes:
- Per-iteration allocation: At the start of each iteration, enough GPU memory must be available to hold the kv_cache that will be generated during that step. If the free_gpu_memory_fraction is too low to meet this requirement, execution will fail. This effectively sets a hard lower bound.
- Accumulated cache management: Beyond the memory needed for the current iteration, the remaining GPU memory is used to store accumulated (i.e., past) kv_cache. If this accumulated cache exceeds the available space, the least recently used entries are offloaded to CPU memory—up to the size limit set by host_cache_size.
So, to directly answer your question:
If free_gpu_memory_fraction is too low to support a single iteration’s kv_cache, the run will fail.
If there's enough space for a single iteration, then offloading to CPU can occur when the total on-GPU kv_cache usage exceeds the available memory. When a new input requires previously generated cache, the cache manager will first search in GPU and CPU memory (if offloaded). If the needed cache has been evicted from both GPU and CPU, it will be recomputed.
Due to the issue’s prolonged inactivity, I’m closing it. I hope the comments above have addressed the question. If the problem persists in the latest release, please open a new issue. Thanks!