TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Force KV Cache Offload

Open khayamgondal opened this issue 9 months ago • 5 comments

Hello, I am trying to run kv_cache_offload benchmark . My question is there a way to force trt to only use CPU for KV Cache? This example shows how to offload KV Cache to CPU when its overflows GPU memory. I am trying to run an experiment to check how much throughput/ latency get effected if I keep the KV Cache on GPU vs offload it to CPU

khayamgondal avatar Mar 27 '25 18:03 khayamgondal

Hi @khayamgondal

We have some performance study before of offloading KV cache to CPU and the finding at that time tells us that there isn't perf gain, so we only make CPU offloading an opt-in feature for KV Cache reuse.

Can you elaborate a little bit more about why you want to only use CPU for KV Cache? Based on your workload(BS/ISl/OSL), it might not be very difficult to have some theoretical number to have a qualitative estimation about the potential perf gain. We are also happy to learn about your scenarios.

Thanks June

@Kefeng-Duan who has done some related perf study before.

juney-nvidia avatar Mar 27 '25 23:03 juney-nvidia

Thanks, June I'm working on a study to understand how much hit performance takes when part of the inference process (KV cache in this scenario) is offloaded to the CPU. The experiment I want to run is inference where the entire run will fit on 96GB GH200 then I want to run the same experiment with some of the KV cache offloaded to the CPU and see how much that degrades performance.

On Thu, Mar 27, 2025, 6:59 PM juney-nvidia @.***> wrote:

Hi @khayamgondal https://github.com/khayamgondal

We have some performance study of offloading KV cache to CPU and the finding at that time tells us that there isn't perf gain, so we only make CPU offloading a opt-in feature for KV Cache reuse.

Can you elaborate a little bit more about why you want to only use CPU for KV Cache? Based on your workload(BS/ISl/OSL), it might not be very difficult to have some theoretical number to have a qualitative estimation about the potential perf gain. We are also happy to learn about your scenarios.

Thanks June

@Kefeng-Duan https://github.com/Kefeng-Duan who has done some related perf study before.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/3130#issuecomment-2759849180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG36QYS3U5R47M3SHCJT2WSGGNAVCNFSM6AAAAABZ5WLBKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJZHA2DSMJYGA . You are receiving this because you were mentioned.Message ID: @.***> [image: juney-nvidia]juney-nvidia left a comment (NVIDIA/TensorRT-LLM#3130) https://github.com/NVIDIA/TensorRT-LLM/issues/3130#issuecomment-2759849180

Hi @khayamgondal https://github.com/khayamgondal

We have some performance study of offloading KV cache to CPU and the finding at that time tells us that there isn't perf gain, so we only make CPU offloading a opt-in feature for KV Cache reuse.

Can you elaborate a little bit more about why you want to only use CPU for KV Cache? Based on your workload(BS/ISl/OSL), it might not be very difficult to have some theoretical number to have a qualitative estimation about the potential perf gain. We are also happy to learn about your scenarios.

Thanks June

@Kefeng-Duan https://github.com/Kefeng-Duan who has done some related perf study before.

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/TensorRT-LLM/issues/3130#issuecomment-2759849180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATNG36QYS3U5R47M3SHCJT2WSGGNAVCNFSM6AAAAABZ5WLBKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJZHA2DSMJYGA . You are receiving this because you were mentioned.Message ID: @.***>

khayamgondal avatar Mar 28 '25 03:03 khayamgondal

Thanks, June I'm working on a study to understand how much hit performance takes when part of the inference process (KV cache in this scenario) is offloaded to the CPU. The experiment I want to run is inference where the entire run will fit on 96GB GH200 then I want to run the same experiment with some of the KV cache offloaded to the CPU and see how much that degrades performance.

Got it, since now it is company holiday, let me ping the related colleagues next week to see whether there may be something suitable to be shared.

Also since the KV cache offloading related logics in TensorRT-LLM are completely open-source, feel free to hack it by yourself to study more:)

June

juney-nvidia avatar Mar 28 '25 05:03 juney-nvidia

Thanks @juney-nvidia I am looking at KvCacheConfig class and wondering if I set the following to 0, would this force not to use GPU for KV cache?

    free_gpu_memory_fraction: Optional[float] = Field(
        default=None,
        description=
        "The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both `max_tokens` and `free_gpu_memory_fraction` are specified, memory corresponding to the minimum will be used."
    )

khayamgondal avatar Mar 28 '25 15:03 khayamgondal

Thanks @juney-nvidia I am looking at KvCacheConfig class and wondering if I set the following to 0, would this force not to use GPU for KV cache?

    free_gpu_memory_fraction: Optional[float] = Field(
        default=None,
        description=
        "The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both `max_tokens` and `free_gpu_memory_fraction` are specified, memory corresponding to the minimum will be used."
    )

I am not very certain. If you are using this script, then you can try to set free_gpu_memory_fraction to be a very small value by leaving the host_cache_size with the default value and then observe the behavior.

The tricky thing is that when free_gpu_memory_fraction become so small, it may result into that the batch scheduler cannot schedule new request, but it is worth a try.

Thanks June

juney-nvidia avatar Mar 29 '25 09:03 juney-nvidia

@Kefeng-Duan do you have some insights on forcing kv cache offload to CPU ?

khayamgondal avatar Mar 31 '25 18:03 khayamgondal

Hi @khayamgondal, if you are using the Python bindings you need to include the executor and pass the following parameters through Pybind11 to trtllm's C++ type system:

Executor and configuration types:

import tensorrt_llm.bindings.executor as trtllm
from tensorrt_llm.bindings.executor import KvCacheTransferMode
from tensorrt_llm.llmapi import (
    LLM,
    BuildConfig,
    KvCacheConfig,
    SchedulerConfig,
    CapacitySchedulerPolicy,
)

Configuration:

  1. KvCacheConfig (settings below)
kv_cache_config=KvCacheConfig(
            enable_block_reuse=True,
            host_cache_size=HOST_CACHE_BYTES,
            max_tokens=MAX_SEQ_LEN,
            secondary_offload_min_priority=kv_range_priority,
)
  1. CapacitySchedulerPolicy
scheduler_config=SchedulerConfig(
            capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
)
  1. KvCacheRetentionConfig
retention_cfg=trtllm.KvCacheRetentionConfig(
        [
            trtllm.KvCacheRetentionConfig.TokenRangeRetentionConfig(
                token_start=0,
                token_end=None,
                priority=args.kv_range_priority,
                duration_ms=datetime.timedelta(seconds=args.kv_retention_secs),
            )
        ],
        decode_retention_priority=args.kv_decode_priority,
        transfer_mode=mode,
        directory=args.kv_transfer_dir,
)

Here's some example code pieces for you to work into your script which uses the tensorrt_llm python bindings:

import tensorrt_llm.bindings.executor as trtllm
from tensorrt_llm.bindings.executor import KvCacheTransferMode
from tensorrt_llm.llmapi import (
    LLM,
    BuildConfig,
    KvCacheConfig,
    SchedulerConfig,
    CapacitySchedulerPolicy,
)

def make_retention_cfg(args: argparse.Namespace) -> trtllm.KvCacheRetentionConfig:
    mode = {
        "DRAM":  KvCacheTransferMode.DRAM,
        "GDS":   KvCacheTransferMode.GDS,
        "POSIX": KvCacheTransferMode.POSIX_DEBUG_FALLBACK,
    }[args.kv_transfer_mode]

    return trtllm.KvCacheRetentionConfig(
        [
            trtllm.KvCacheRetentionConfig.TokenRangeRetentionConfig(
                token_start=0,
                token_end=None,
                priority=args.kv_range_priority,
                duration_ms=datetime.timedelta(seconds=args.kv_retention_secs),
            )
        ],
        decode_retention_priority=args.kv_decode_priority,
        transfer_mode=mode,
        directory=args.kv_transfer_dir,
    )

def build_llm(args: argparse.Namespace) -> LLM:
    """Instantiate TensorRT‑LLM `LLM`."""
    llm = LLM(
        model=args.safetensor_dir,
        tokenizer=args.tokenizer_dir,
        dtype=args.dtype,
        build_config=BuildConfig(
            max_beam_width=1,
            max_batch_size=8,
            max_num_tokens=MAX_SEQ_LEN,
            max_seq_len=MAX_SEQ_LEN,
        ),
        kv_cache_config=KvCacheConfig(
            enable_block_reuse=True,
            host_cache_size=HOST_CACHE_BYTES,
            max_tokens=MAX_SEQ_LEN,
            secondary_offload_min_priority=args.kv_range_priority,
        ),
        scheduler_config=SchedulerConfig(
            capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
        ),
        tensor_parallel_size=args.tensor_parallel_size,
        pipeline_parallel_size=args.pipeline_parallel_size,
    )
    return llm

llm = build_llm(args)
retention_cfg = make_retention_cfg(args)
gen = llm.generate_async(
            inputs=ids,
            max_new_tokens=req.max_tokens,
            temperature=req.temperature,
            top_p=req.top_p,
            kv_cache_retention_config=retention_cfg,
            streaming=req.stream,
)

arthurrasmusson avatar Jul 18 '25 21:07 arthurrasmusson

@juney-nvidia @khayamgondal How to offload all the KV Cache to CPU ?

trtllm-serve nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 --tp_size 1 \
    --host 0.0.0.0 \
    --max_batch_size 2 \
    --max_seq_len 32768 \
    --extra_llm_api_options nemo-serve.yaml    
kv_cache_config:
  host_cache_size: 10000000000
  free_gpu_memory_fraction: 0.50

Does not work. trtllm-serve tries to fill the GPU memory only reducing the max_seq_len.

rnik12 avatar Aug 26 '25 18:08 rnik12