feat:enable kvcache to be reused during request generation
Feat:enable kvcache to be reused during request generation
Issue: [https://github.com/NVIDIA/TensorRT-LLM/issues/3733]
[issues/3733][feat] enable kvcache to be reused during request generation
Description
This PR enhances the KV cache reuse logic in TensorRT-LLM by enabling block reuse during the generation phase, not just after request completion. Specifically:
- Introduced a new interface
storeBlocksForReuse()inKVCacheManager, allowing selective caching of KV blocks while generation is still ongoing. - This improves memory reuse for partially generated requests, reducing overall KV block allocation pressure in long-running or chunked scenarios.
Changes
- Added
KVCacheManager::storeBlocksForReuse(LlmRequest const&)to allow early KV block reuse. - Integrated block reuse logic into the generation flow, enabling reuse before full completion of a request.
- Extended the behavior verified by the existing unit test
KVCacheReuseChunked:- Previously, reuse was only validated after full generation completion.
- Now it also verifies reuse after partial generation phases.
Test Coverage
- The existing unit test
KVCacheReuseChunkedhas been extended to validate reuse during generation. - Block reuse counters are now checked incrementally during the decoding process.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run /bot [-h|--help] to print this help message.
See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.
--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.
--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.
--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.
--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.
--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.
--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.
--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.
--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".
kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
@WeiHaocheng @dc3671 @thorjohnsen Hi Fred/Zhenhuan/Thor
Can you help review this PR from the community?
Thanks June
@WeiHaocheng @dc3671 @thorjohnsen Hi Fred/Zhenhuan/Thor
Can you help review this PR from the community?
Thanks June
Sure~Let me review it~
Looks like we can store the kv cache block only when the new block is generated and only store the new block. Let me talk with @narutolhy offline.
/bot run
@thorjohnsen Hi Thor~ Could you help to review this PR?
/bot run
PR_Github #5323 [ run ] triggered by Bot
PR_Github #5323 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #3882 completed with status: 'FAILURE'
/bot run
PR_Github #5340 [ run ] triggered by Bot
PR_Github #5340 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3893 completed with status: 'FAILURE'
Note that this feature needs some of partial matching feature for correct execution (specifically logic that copies kv state instead of sharing block when partially filled reusable block is already owned by another request). Partial matching is a feature that can be disabled with a configuration option, but this code would still execute correctly with partial matching disabled because the above-mentioned logic cannot be disabled (you would see less reuse though).
My one concern is that this might introduce significant CPU overhead. It looks like the last block of each generation request is stored in every iteration, so each block gets stored blockSize times. If it turns out to be an issue there are easy ways to mitigate this, so I am still approving.
Thanks for @thorjohnsen's review~
Looks like it has a judge here so each block only need to be stored once. Let's merge it and observe if it introduce obvious perf issue.
/bot reuse-pipeline
PR_Github #5549 [ reuse-pipeline ] triggered by Bot
PR_Github #5549 [ reuse-pipeline ] completed with state SUCCESS
Can't reuse PR_Github #5340 with status: FAILED
/bot run
PR_Github #5552 [ run ] triggered by Bot
PR_Github #5552 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4051 completed with status: 'FAILURE'
/bot run
/bot run
PR_Github #5581 [ run ] triggered by Bot
PR_Github #5581 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4069 completed with status: 'FAILURE'
/bot run
PR_Github #5650 [ run ] triggered by Bot
PR_Github #5650 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4127 completed with status: 'FAILURE'
/bot run
PR_Github #5717 [ run ] triggered by Bot
PR_Github #5717 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4178 completed with status: 'FAILURE'
/bot run