[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph
NvBug
https://nvbugspro.nvidia.com/bug/5322131 https://nvbugspro.nvidia.com/bug/5441746
Benchmark Perf
merge base 43c46a09
| Avg ITL (ms) | Throughput (tok/s) | Avg TTFT (ms) | |
|---|---|---|---|
| CUDA Graph + multi LoRA (after changes) | 26 | 568 | 227 |
| no CUDA Graph + multi LoRA (before changes) | 119 | 127 | 435 |
| CUDA Graph + no LoRA (before changes) | 14 | 1068 | 137 |
Llama 3.3 70B, TP8; p5.48xlarge 8xH100 ISL: 1600; OSL: 600; Concurrency: 16; All requests query the same LoRA
Still need to remove some code for logging / testing
Potential Future Optimizations not included in this PR
- Update prefill + decode fused batch to the new LoRA path, which might reduce bubbles in all-reduce
- Update the sm80 (split-K) group GEMMs currently used
Summary by CodeRabbit
-
New Features
- Added CUDA Graph mode for LoRA with multi-adapter batching and slot management.
- Introduced fused parameter preparation and row reordering to reduce kernel launches.
- Exposed a device-side cache check for tasks in Python.
- Enabled optional NVTX profiling wrappers for easier performance tracing.
-
Performance
- Implemented CUDA Graph–compatible grouped and split-K GEMM paths for faster LoRA execution.
- Reduced per-step overhead via persistent buffers and slot reuse.
-
Tests
- Expanded test coverage to run LoRA scenarios with and without CUDA Graph, including edge cases.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
-
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
-
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
-
Test cases are provided for new code paths (see test instructions)
-
Any new dependencies have been scanned for license and vulnerabilities
-
CODEOWNERS updated if ownership changes
-
Documentation updated as needed
-
The reviewers assigned automatically/manually are appropriate for the PR.
-
[ ] Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run /bot [-h|--help] to print this help message.
See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.
--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.
--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.
--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.
--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.
--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.
--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.
--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.
--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.
--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.
--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.
--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".
--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.
--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.
For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.
kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
📝 Walkthrough
Walkthrough
Adds CUDA Graph-based multi-LoRA execution: new grouped GEMM kernels and a fused param-fill/reorder kernel; Torch bindings and THOP entry points; Python-side CUDA Graph LoRA params, slot and manager classes; engine integration with optional CUDA Graph path; PEFT cache/device-lookup extensions; tests and NVTX profiling hooks. Also adjusts attention and miscellaneous bindings.
Changes
| Cohort / File(s) | Summary |
|---|---|
PEFT cache API updatescpp/include/tensorrt_llm/batch_manager/peftCacheManager.h, cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp |
Add PeftCacheManager::isTaskCachedDevice; ensureBatch maps taskId to device-resolved LoRA config. |
Bindings for PEFT cachecpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp |
Expose is_task_cached and new is_task_cached_device to Python with GIL release. |
CUDA Graph grouped GEMM kernelscpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.h, cpp/tensorrt_llm/kernels/cuda_graph_grouped_gemm.cu |
New CUDA Graph-compatible grouped GEMM and split-K grouped GEMM implementations and declarations. |
LoRA fused prep kernelcpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.h, cpp/tensorrt_llm/kernels/lora/loraGroupGEMMParamFillRowReorderFusion.cu |
New fused kernel to fill params, row-reorder, zero-fill, with launcher API. |
LoRA kernel commentcpp/tensorrt_llm/kernels/lora/lora.cpp |
Add clarifying comment on GemmCoord usage. |
THOP LoRA opscpp/tensorrt_llm/thop/loraOp.cpp |
Add CUDA Graph grouped GEMM path and fused param-fill/reorder entry; register Torch ops. |
Attention module tweaktensorrt_llm/_torch/modules/attention.py |
Wrap o_lora init in string literal; o_lora assignment skipped. |
CUDA Graph LoRA manager and paramstensorrt_llm/_torch/peft/lora/adapter_slot_manager.py, .../cuda_graph_lora_manager.py, .../cuda_graph_lora_params.py |
Add AdapterSlotManager (LRU slots), CudaGraphLoraManager (prep flow), and CudaGraphLoraParams (persistent CUDA Graph buffers, pointers, sizes). |
LoRA layer integrationtensorrt_llm/_torch/peft/lora/layer.py |
Add CUDA Graph mode in forward, buffer prep helpers, dataclasses for grouped GEMM params, tensorized size metadata. |
Engine integrationtensorrt_llm/_torch/pyexecutor/model_engine.py, .../_util.py, .../py_executor.py, tensorrt_llm/executor/worker.py |
Initialize CUDA Graph LoRA manager, propagate maybe_graph, add NVTX emit decorator and tracing wrapper; minor prints. |
Resource/PEFT plumbingtensorrt_llm/_torch/pyexecutor/resource_manager.py |
Add batch PEFT table getters/reset; expose is_task_cached_device; track batch PEFT state. |
NVTX utilitytensorrt_llm/_utils.py |
Add nvtx_pytorch_emit decorator factory. |
Tests: CUDA Graph LoRAtests/unittest/llmapi/lora_test_utils.py, tests/unittest/llmapi/test_llm_pytorch.py |
Add CUDA Graph LoRA test params and helpers; parametrize many tests with cuda_graph_config; add kernel special-case tests. |
Sequence Diagram(s)
sequenceDiagram
autonumber
participant Engine as PyTorchModelEngine
participant LoraMgr as CudaGraphLoraManager
participant SlotMgr as AdapterSlotManager
participant Peft as PeftCacheManager
participant Params as CudaGraphLoraParams
participant THOP as thop lora ops
participant Kern as CUDA Graph GEMM Kernels
Engine->>LoraMgr: prepare_cuda_graph_lora_params(scheduled_requests, attn_metadata, peft_cache_manager)
LoraMgr->>Peft: get_and_reset_batch_peft_table()
LoraMgr->>SlotMgr: update_slots(requests, peft_cache_manager)
SlotMgr-->>LoraMgr: batch_slot_ids, slots_changed
LoraMgr->>Params: update_sorted_indices(batch_slot_ids)
alt slots_changed
LoraMgr->>Params: update_weight_pointers(peft_table)
LoraMgr->>SlotMgr: reset_changed_flag()
end
LoraMgr->>Params: update_slots_params(batch_slot_ids)
LoraMgr-->>Engine: {cuda_graph_params, use_cuda_graph_mode, ...}
Engine->>THOP: lora_group_gemm_param_fill_row_reorder_fusion(...)
THOP->>Kern: launchLoraGroupGEMMParamFillRowReorderFusion(...)
Engine->>THOP: lora_grouped_gemm_cuda_graph(... in/out ...)
THOP->>Kern: cuda_graph_grouped_gemm / splitk_grouped_gemm(...)
sequenceDiagram
autonumber
participant Layer as LoraLayer.forward
participant CG as CUDA-Graph path
participant Legacy as Legacy path
Layer->>Layer: decide mode (cuda_graph_enabled && params available)
alt CUDA Graph mode
Layer->>CG: _forward_cuda_graph_mode(...)
CG->>CG: prepare_grouped_gemm_buffers / fused prep
CG-->>Layer: output tensor or None
else Legacy mode
Layer->>Legacy: _forward_legacy_mode(...)
Legacy-->>Layer: output tensor or None
end
Estimated code review effort
🎯 5 (Critical) | ⏱️ ~120 minutes
Pre-merge checks and finishing touches
❌ Failed checks (2 warnings)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
| Description Check | ⚠️ Warning | The pull request description is significantly incomplete. While the author has provided valuable context including NVBug references and performance benchmarks showing the CUDA Graph multi-LoRA improvement, the core required sections from the template are missing. The "Description" section that should explain the issue and solution is empty, containing only placeholder text. The "Test Coverage" section that should list relevant tests is also empty. Additionally, the author explicitly notes "Still need to remove some code for logging / testing," indicating the PR requires further cleanup before merging. The PR checklist remains unchecked and unvalidated. |
✅ Passed checks (1 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title Check | ✅ Passed | The pull request title "[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph" accurately describes the primary intent of the changeset. The modifications comprehensively implement multi-LoRA serving support with CUDA Graph compatibility, including new GPU kernels for grouped GEMM operations, slot management infrastructure (AdapterSlotManager, CudaGraphLoraParams, CudaGraphLoraManager), Python bindings, TorchScript operations, and integration into the execution engine and resource managers. The title is clear, concise, and specific—it conveys that the feature enables serving multiple LoRA adapters using CUDA Graph, which directly matches the scope of changes across all modified files. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
/bot run
PR_Github #21382 [ run ] triggered by Bot
PR_Github #21382 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #16147 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #21388 [ run ] triggered by Bot
PR_Github #21388 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16153 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #21435 [ run ] triggered by Bot
PR_Github #21435 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #16187 completed with status: 'FAILURE'
/bot run --disable-fail-fast
/bot run --disable-fail-fast
PR_Github #21728 [ run ] triggered by Bot. Commit: c194938
PR_Github #21728 [ run ] completed with state SUCCESS. Commit: c194938
/LLM/main/L0_MergeRequest_PR pipeline #16375 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #21959 [ run ] triggered by Bot. Commit: d722047
PR_Github #21959 [ run ] completed with state SUCCESS. Commit: d722047
/LLM/main/L0_MergeRequest_PR pipeline #16556 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #22334 [ run ] triggered by Bot. Commit: 3f7441d
PR_Github #22334 [ run ] completed with state SUCCESS. Commit: 3f7441d
/LLM/main/L0_MergeRequest_PR pipeline #16838 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #23238 [ run ] triggered by Bot. Commit: 555ec91
PR_Github #23238 [ run ] completed with state FAILURE. Commit: 555ec91
/LLM/main/L0_MergeRequest_PR pipeline #17516 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #23503 [ run ] triggered by Bot. Commit: 65994dd
PR_Github #23503 [ run ] completed with state SUCCESS. Commit: 65994dd
/LLM/main/L0_MergeRequest_PR pipeline #17690 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #23626 [ run ] triggered by Bot. Commit: b1c10b3
PR_Github #23626 [ run ] completed with state SUCCESS. Commit: b1c10b3
/LLM/main/L0_MergeRequest_PR pipeline #17776 completed with status: 'FAILURE'
Does the change also affect the runtime w/o CUDA graphs? (
no CUDA Graph + multi LoRA (after changes))
It should not as we keep the same legacy LoRA code in LoraLayer._forward_legacy_mode. It is used when cuda graph is disabled or in prefiil-decode mixed batch.
@Funatiq