TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

feat:enable kvcache to be reused during request generation

Open narutolhy opened this issue 8 months ago • 23 comments

Feat:enable kvcache to be reused during request generation

Issue: [https://github.com/NVIDIA/TensorRT-LLM/issues/3733]

[issues/3733][feat] enable kvcache to be reused during request generation

Description

This PR enhances the KV cache reuse logic in TensorRT-LLM by enabling block reuse during the generation phase, not just after request completion. Specifically:

  • Introduced a new interface storeBlocksForReuse() in KVCacheManager, allowing selective caching of KV blocks while generation is still ongoing.
  • This improves memory reuse for partially generated requests, reducing overall KV block allocation pressure in long-running or chunked scenarios.

Changes

  • Added KVCacheManager::storeBlocksForReuse(LlmRequest const&) to allow early KV block reuse.
  • Integrated block reuse logic into the generation flow, enabling reuse before full completion of a request.
  • Extended the behavior verified by the existing unit test KVCacheReuseChunked:
    • Previously, reuse was only validated after full generation completion.
    • Now it also verifies reuse after partial generation phases.

Test Coverage

  • The existing unit test KVCacheReuseChunked has been extended to validate reuse during generation.
  • Block reuse counters are now checked incrementally during the decoding process.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

narutolhy avatar May 02 '25 10:05 narutolhy

@WeiHaocheng @dc3671 @thorjohnsen Hi Fred/Zhenhuan/Thor

Can you help review this PR from the community?

Thanks June

juney-nvidia avatar May 02 '25 10:05 juney-nvidia

@WeiHaocheng @dc3671 @thorjohnsen Hi Fred/Zhenhuan/Thor

Can you help review this PR from the community?

Thanks June

Sure~Let me review it~

WeiHaocheng avatar May 07 '25 01:05 WeiHaocheng

Looks like we can store the kv cache block only when the new block is generated and only store the new block. Let me talk with @narutolhy offline.

WeiHaocheng avatar May 08 '25 03:05 WeiHaocheng

/bot run

WeiHaocheng avatar May 12 '25 13:05 WeiHaocheng

@thorjohnsen Hi Thor~ Could you help to review this PR?

WeiHaocheng avatar May 13 '25 03:05 WeiHaocheng

/bot run

WeiHaocheng avatar May 15 '25 08:05 WeiHaocheng

PR_Github #5323 [ run ] triggered by Bot

tensorrt-cicd avatar May 15 '25 09:05 tensorrt-cicd

PR_Github #5323 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #3882 completed with status: 'FAILURE'

tensorrt-cicd avatar May 15 '25 10:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 15 '25 10:05 WeiHaocheng

PR_Github #5340 [ run ] triggered by Bot

tensorrt-cicd avatar May 15 '25 10:05 tensorrt-cicd

PR_Github #5340 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #3893 completed with status: 'FAILURE'

tensorrt-cicd avatar May 15 '25 14:05 tensorrt-cicd

Note that this feature needs some of partial matching feature for correct execution (specifically logic that copies kv state instead of sharing block when partially filled reusable block is already owned by another request). Partial matching is a feature that can be disabled with a configuration option, but this code would still execute correctly with partial matching disabled because the above-mentioned logic cannot be disabled (you would see less reuse though).

thorjohnsen avatar May 16 '25 18:05 thorjohnsen

My one concern is that this might introduce significant CPU overhead. It looks like the last block of each generation request is stored in every iteration, so each block gets stored blockSize times. If it turns out to be an issue there are easy ways to mitigate this, so I am still approving.

Thanks for @thorjohnsen's review~ image Looks like it has a judge here so each block only need to be stored once. Let's merge it and observe if it introduce obvious perf issue.

WeiHaocheng avatar May 17 '25 01:05 WeiHaocheng

/bot reuse-pipeline

WeiHaocheng avatar May 17 '25 01:05 WeiHaocheng

PR_Github #5549 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd avatar May 17 '25 01:05 tensorrt-cicd

PR_Github #5549 [ reuse-pipeline ] completed with state SUCCESS Can't reuse PR_Github #5340 with status: FAILED

tensorrt-cicd avatar May 17 '25 01:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 17 '25 02:05 WeiHaocheng

PR_Github #5552 [ run ] triggered by Bot

tensorrt-cicd avatar May 17 '25 02:05 tensorrt-cicd

PR_Github #5552 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4051 completed with status: 'FAILURE'

tensorrt-cicd avatar May 17 '25 04:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 17 '25 08:05 WeiHaocheng

/bot run

WeiHaocheng avatar May 18 '25 00:05 WeiHaocheng

PR_Github #5581 [ run ] triggered by Bot

tensorrt-cicd avatar May 18 '25 01:05 tensorrt-cicd

PR_Github #5581 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4069 completed with status: 'FAILURE'

tensorrt-cicd avatar May 18 '25 03:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 19 '25 01:05 WeiHaocheng

PR_Github #5650 [ run ] triggered by Bot

tensorrt-cicd avatar May 19 '25 01:05 tensorrt-cicd

PR_Github #5650 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #4127 completed with status: 'FAILURE'

tensorrt-cicd avatar May 19 '25 03:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 19 '25 10:05 WeiHaocheng

PR_Github #5717 [ run ] triggered by Bot

tensorrt-cicd avatar May 19 '25 10:05 tensorrt-cicd

PR_Github #5717 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #4178 completed with status: 'FAILURE'

tensorrt-cicd avatar May 19 '25 13:05 tensorrt-cicd

/bot run

WeiHaocheng avatar May 19 '25 13:05 WeiHaocheng