[FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend
Features
-
Verified with VANILLA and CUTLASS MoE backend.
-
Support BF16 / FP8 / NVFP4 models.
-
Support multi-stream for MoE shared and MoE chunking.
Summary by CodeRabbit
-
New Features
- Added Mixture-of-Experts support with flexible activation type configuration
- Introduced support for Nemotron-Nano model variant
-
Improvements
- Enhanced weight quantization for MoE operations
- Optimized parallel MoE execution with improved stream management
-
Tests
- Expanded test suite to cover additional Nemotron model variants
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
-
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
-
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
-
Test cases are provided for new code paths (see test instructions)
-
Any new dependencies have been scanned for license and vulnerabilities
-
CODEOWNERS updated if ownership changes
-
Documentation updated as needed
-
Update tava architecture diagram if there is a significant design change in PR.
-
The reviewers assigned automatically/manually are appropriate for the PR.
-
[x] Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run /bot [-h|--help] to print this help message.
See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.
--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.
--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.
--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.
--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.
--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.
--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.
--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.
--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.
--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.
--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.
--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".
--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.
--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.
For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.
kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
๐ Walkthrough
Walkthrough
The changes add activation-type parameterization throughout the MoE quantization and weight-handling pipeline, introduce a new NemotronHMOE module with auxiliary CUDA stream support, extend weight mappers for MoE expert handling, and add utility functions for gated activation detection, weight splitting, and relu-squared computation. Changes span C++ kernels, Python model definitions, quantization logic, and test parameterization.
Changes
| Cohort / File(s) | Summary |
|---|---|
C++ MoE Quantization cpp/tensorrt_llm/thop/moeOp.cpp |
Added base_activation_type parameter to FusedMoeRunner::getQuantParams(). Introduces expand_ratio derived from activation type to adjust weight validation sizes from fixed factor 2 to dynamic factors in MXFP4/MXF8 and NVFP4 branches. |
HF Checkpoint Weight Mappers tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py,tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py |
Updated import paths for split from local to centralized utils. Added MoE expert weight remapping logic in nemotron_h_weight_mapper to handle VANILLA backend (direct copy) and non-VANILLA backends (up_proj โ w1/w3, down_proj โ w2 with scale handling for FP8/NVFP4). |
Nemotron-H Model Definition tensorrt_llm/_torch/models/modeling_nemotron_h.py |
Introduced NemotronHMOE class implementing gated MoE with latent projection layers and auxiliary stream-based parallel execution. Extended NemotronHLayer to route layer type "E" to MoE and accept aux_stream_dict. Updated NemotronHModel to initialize auxiliary CUDA streams (MoeShared, MoeChunkingOverlap, MoeBalancer). Normalized rms_norm_eps in NemotronHForCausalLM from config. |
MoE Module Factory & Interfaces tensorrt_llm/_torch/modules/fused_moe/create_moe.py,tensorrt_llm/_torch/modules/fused_moe/interface.py |
Added activation_type parameter to create_moe() and propagated to backend constructors. Introduced internal is_gated_activation flag and intermediate_size_expand_ratio (2 for gated, 1 otherwise) in MoE base class for use in weight shape calculations. |
MoE Backend Implementations tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py,tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py |
Added activation_type and layer_idx parameters to both implementations. VanillaMoE includes gating-aware expert creation (MLP with relu2 for Relu2 activation, GatedMLP otherwise) and validation errors for unsupported non-gated configs. CutlassFusedMoE forwards activation type to base class and kernel. |
MoE Quantization & Weight Handling tensorrt_llm/_torch/modules/fused_moe/quantization.py |
Replaced hardcoded factor-2 multipliers with intermediate_size_expand_ratio in weight shape calculations (w3_w1_weight dimensions, w3_w1_weight_shape, scales). Updated split logic to use split_length = intermediate_size_per_partition * expand_ratio // 2 for w3/w1 slicing across FP8, NVFP4, and TRT variants. |
Utility Functions tensorrt_llm/_torch/utils.py |
Added is_gated_activation(ActivationType) โ bool to identify Swiglu/SwigluBias/Geglu activations. Added split(x, tp_size, idx, dim=0) โ torch.Tensor for tensor partitioning with divisibility validation. Added relu2(x) โ torch.Tensor computing relu-squared via F.relu. Added import of torch.nn.functional as F. |
Test Infrastructure tests/unittest/_torch/modeling/test_modeling_nemotron_h.py |
Parameterized tests with model_folder to support Nemotron-H-8B-Base-8K and Nemotron-Nano-3-30B-A3.5B-dev-1024. Updated create_nemotron_h_llm() signature to accept and route model_folder for model path construction. Replaced static GPU memory skips with per-model conditional skips. Added model-specific reference logprobs, tolerances, and expectations (exact checks for smaller model, fuzzy comparison via similar() for larger model). |
Sequence Diagram(s)
sequenceDiagram
participant App as Application
participant CreateMoE as create_moe()
participant Backend as MoE Backend<br/>(Cutlass/Vanilla)
participant Interface as MoE Interface
participant Quantization as Quantization
participant Kernel as Kernel/C++
App->>CreateMoE: create_moe(..., activation_type)
CreateMoE->>Backend: new Backend(..., activation_type)
Backend->>Interface: super().__init__(..., activation_type)
Interface->>Interface: is_gated = is_gated_activation(activation_type)
Interface->>Interface: expand_ratio = 2 if is_gated else 1
alt Vanilla MoE Path
Backend->>Backend: if activation_type == Relu2<br/>create MLP experts
Backend->>Backend: else create GatedMLP experts
end
alt Weight Loading Path
Quantization->>Quantization: split_length = inter_size * expand_ratio // 2
Quantization->>Quantization: allocate w3_w1 with expand_ratio scaling
Quantization->>Kernel: pass expand_ratio to C++ quantParams
end
Backend->>Kernel: forward(..., activation_type)
Kernel->>Kernel: getQuantParams(..., base_activation_type)<br/>adjust validation per activation type
Kernel-->>Backend: result
Backend-->>App: output
sequenceDiagram
participant Model as NemotronHModel
participant Layer as NemotronHLayer
participant MoE as NemotronHMOE
participant Router as Gate Router
participant AuxStream as Aux CUDA Stream
Model->>Model: __init__: create aux_stream_dict<br/>(MoeShared, Overlap, Balancer)
Model->>Layer: pass aux_stream_dict
Layer->>Layer: route layer_type=="E" to MoE
Layer->>MoE: new NemotronHMOE(..., aux_stream_dict)
MoE->>MoE: init latent projections (if enabled)
MoE->>MoE: init gate and experts
Layer->>MoE: forward(hidden_states)
MoE->>Router: compute routing weights
par Parallel Execution
MoE->>MoE: shared path through gate
MoE->>AuxStream: route to MoeShared stream
and
MoE->>MoE: expert path computation
MoE->>AuxStream: route to MoeChunkingOverlap stream
end
AuxStream->>MoE: synchronize outputs
MoE-->>Layer: combined result
Layer-->>Model: propagate output
Estimated code review effort
๐ฏ 4 (Complex) | โฑ๏ธ ~60 minutes
-
C++ quantization logic (
moeOp.cpp): New branching onexpand_ratiorequires verification against all quantization paths (MXFP4, NVFP8, NVFP4) to ensure size calculations remain correct and error messages align. -
New NemotronHMOE module (
modeling_nemotron_h.py): Introduces untested parallel execution with auxiliary streams, latent projection logic, and new layer routing; requires careful verification of stream management and synchronization correctness. - Activation-type propagation across multiple MoE backends: Dense, interconnected parameter threading through factory, interface, vanilla, and Cutlass implementations; each backend's gating-aware expert initialization requires independent reasoning.
-
Weight remapping complexity (
nemotron_h_weight_mapper.py): Non-trivial MoE expert weight transformation logic (up_proj โ w1/w3 splitting, scale handling per backend) with multiple error paths that need coverage testing. -
Quantization weight shape updates (
quantization.py): Widespread replacement of factor-2 withexpand_ratioacross multiple quantization variants (FP8, NVFP4, TRT) needs verification that slicing logic produces correct tensor dimensions for both gated and non-gated activations. -
Test parameterization (
test_modeling_nemotron_h.py): Model-specific reference values and conditional tolerance logic; verify that each model's expected outputs and skip conditions are correctly mapped.
Pre-merge checks and finishing touches
โ Failed checks (2 warnings)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Description check | โ ๏ธ Warning | The PR description includes a Features section listing key capabilities, but the required Description and Test Coverage sections are missing (only contain placeholder comments), and the PR checklist is incomplete. | Complete the Description section explaining what was changed and why, and the Test Coverage section listing relevant tests. Ensure all PR checklist items are properly addressed. |
| Docstring Coverage | โ ๏ธ Warning | Docstring coverage is 5.71% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
โ Passed checks (1 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | โ Passed | The title clearly identifies the main feature: adding support for nano-v3 and super-v3 models with the PyTorch backend, matching the changes throughout the pull request. |
โจ Finishing touches
- [ ] ๐ Generate docstrings
๐งช Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
[!TIP]
๐ Customizable high-level summaries are now available in beta!
You can now customize how CodeRabbit generates the high-level summary in your pull requests โ including its content, structure, tone, and formatting.
- Provide your own instructions using the
high_level_summary_instructionssetting.- Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
- Use
high_level_summary_in_walkthroughto move the summary from the description to the walkthrough section.Example instruction:
"Divide the high-level summary into five sections:
- ๐ Description โ Summarize the main change in 50โ60 words, explaining what was done.
- ๐ References โ List relevant issues, discussions, documentation, or related PRs.
- ๐ฆ Dependencies & Requirements โ Mention any new/updated dependencies, environment variable changes, or configuration updates.
- ๐ Contributor Summary โ Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |- โ๏ธ Additional Notes โ Add any extra reviewer context. Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
/bot run
/bot run
PR_Github #25179 [ run ] triggered by Bot. Commit: 94e7884
PR_Github #25179 [ run ] completed with state SUCCESS. Commit: 94e7884
/LLM/main/L0_MergeRequest_PR pipeline #19038 completed with status: 'FAILURE'
/bot run --disable-fail-fast
/bot run --disable-fail-fast
PR_Github #25280 [ run ] triggered by Bot. Commit: ce6a583
PR_Github #25280 [ run ] completed with state FAILURE. Commit: ce6a583
/LLM/main/L0_MergeRequest_PR pipeline #19125 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25340 [ run ] triggered by Bot. Commit: ec1fc19
PR_Github #25340 [ run ] completed with state FAILURE. Commit: ec1fc19
/LLM/main/L0_MergeRequest_PR pipeline #19167 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25520 [ run ] triggered by Bot. Commit: ec1fc19
PR_Github #25520 [ run ] completed with state SUCCESS. Commit: ec1fc19
/LLM/main/L0_MergeRequest_PR pipeline #19326 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25627 [ run ] triggered by Bot. Commit: e72c071
PR_Github #25627 [ run ] completed with state SUCCESS. Commit: e72c071
/LLM/main/L0_MergeRequest_PR pipeline #19416 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25681 [ run ] triggered by Bot. Commit: 855ea7b
PR_Github #25681 [ run ] completed with state SUCCESS. Commit: 855ea7b
/LLM/main/L0_MergeRequest_PR pipeline #19464 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25745 [ run ] triggered by Bot. Commit: 5d91bd3
PR_Github #25745 [ run ] completed with state SUCCESS. Commit: 5d91bd3
/LLM/main/L0_MergeRequest_PR pipeline #19522 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #25793 [ run ] triggered by Bot. Commit: 5d91bd3
PR_Github #25793 [ run ] completed with state FAILURE. Commit: 5d91bd3
/LLM/main/L0_MergeRequest_PR pipeline #19565 completed with status: 'FAILURE'
/bot run --disable-fail-fast
/bot run --disable-fail-fast
PR_Github #25923 [ run ] triggered by Bot. Commit: 5d91bd3