TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend

Open Wanli-Jiang opened this issue 4 months ago โ€ข 20 comments

Features

  • Verified with VANILLA and CUTLASS MoE backend.

  • Support BF16 / FP8 / NVFP4 models.

  • Support multi-stream for MoE shared and MoE chunking.

Summary by CodeRabbit

  • New Features

    • Added Mixture-of-Experts support with flexible activation type configuration
    • Introduced support for Nemotron-Nano model variant
  • Improvements

    • Enhanced weight quantization for MoE operations
    • Optimized parallel MoE execution with improved stream management
  • Tests

    • Expanded test suite to cover additional Nemotron model variants

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • [x] Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Wanli-Jiang avatar Nov 18 '25 09:11 Wanli-Jiang

๐Ÿ“ Walkthrough

Walkthrough

The changes add activation-type parameterization throughout the MoE quantization and weight-handling pipeline, introduce a new NemotronHMOE module with auxiliary CUDA stream support, extend weight mappers for MoE expert handling, and add utility functions for gated activation detection, weight splitting, and relu-squared computation. Changes span C++ kernels, Python model definitions, quantization logic, and test parameterization.

Changes

Cohort / File(s) Summary
C++ MoE Quantization
cpp/tensorrt_llm/thop/moeOp.cpp
Added base_activation_type parameter to FusedMoeRunner::getQuantParams(). Introduces expand_ratio derived from activation type to adjust weight validation sizes from fixed factor 2 to dynamic factors in MXFP4/MXF8 and NVFP4 branches.
HF Checkpoint Weight Mappers
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py,
tensorrt_llm/_torch/models/checkpoints/hf/qwen3_next_weight_mapper.py
Updated import paths for split from local to centralized utils. Added MoE expert weight remapping logic in nemotron_h_weight_mapper to handle VANILLA backend (direct copy) and non-VANILLA backends (up_proj โ†’ w1/w3, down_proj โ†’ w2 with scale handling for FP8/NVFP4).
Nemotron-H Model Definition
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Introduced NemotronHMOE class implementing gated MoE with latent projection layers and auxiliary stream-based parallel execution. Extended NemotronHLayer to route layer type "E" to MoE and accept aux_stream_dict. Updated NemotronHModel to initialize auxiliary CUDA streams (MoeShared, MoeChunkingOverlap, MoeBalancer). Normalized rms_norm_eps in NemotronHForCausalLM from config.
MoE Module Factory & Interfaces
tensorrt_llm/_torch/modules/fused_moe/create_moe.py,
tensorrt_llm/_torch/modules/fused_moe/interface.py
Added activation_type parameter to create_moe() and propagated to backend constructors. Introduced internal is_gated_activation flag and intermediate_size_expand_ratio (2 for gated, 1 otherwise) in MoE base class for use in weight shape calculations.
MoE Backend Implementations
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py,
tensorrt_llm/_torch/modules/fused_moe/fused_moe_vanilla.py
Added activation_type and layer_idx parameters to both implementations. VanillaMoE includes gating-aware expert creation (MLP with relu2 for Relu2 activation, GatedMLP otherwise) and validation errors for unsupported non-gated configs. CutlassFusedMoE forwards activation type to base class and kernel.
MoE Quantization & Weight Handling
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Replaced hardcoded factor-2 multipliers with intermediate_size_expand_ratio in weight shape calculations (w3_w1_weight dimensions, w3_w1_weight_shape, scales). Updated split logic to use split_length = intermediate_size_per_partition * expand_ratio // 2 for w3/w1 slicing across FP8, NVFP4, and TRT variants.
Utility Functions
tensorrt_llm/_torch/utils.py
Added is_gated_activation(ActivationType) โ†’ bool to identify Swiglu/SwigluBias/Geglu activations. Added split(x, tp_size, idx, dim=0) โ†’ torch.Tensor for tensor partitioning with divisibility validation. Added relu2(x) โ†’ torch.Tensor computing relu-squared via F.relu. Added import of torch.nn.functional as F.
Test Infrastructure
tests/unittest/_torch/modeling/test_modeling_nemotron_h.py
Parameterized tests with model_folder to support Nemotron-H-8B-Base-8K and Nemotron-Nano-3-30B-A3.5B-dev-1024. Updated create_nemotron_h_llm() signature to accept and route model_folder for model path construction. Replaced static GPU memory skips with per-model conditional skips. Added model-specific reference logprobs, tolerances, and expectations (exact checks for smaller model, fuzzy comparison via similar() for larger model).

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant CreateMoE as create_moe()
    participant Backend as MoE Backend<br/>(Cutlass/Vanilla)
    participant Interface as MoE Interface
    participant Quantization as Quantization
    participant Kernel as Kernel/C++

    App->>CreateMoE: create_moe(..., activation_type)
    CreateMoE->>Backend: new Backend(..., activation_type)
    Backend->>Interface: super().__init__(..., activation_type)
    Interface->>Interface: is_gated = is_gated_activation(activation_type)
    Interface->>Interface: expand_ratio = 2 if is_gated else 1
    
    alt Vanilla MoE Path
        Backend->>Backend: if activation_type == Relu2<br/>create MLP experts
        Backend->>Backend: else create GatedMLP experts
    end
    
    alt Weight Loading Path
        Quantization->>Quantization: split_length = inter_size * expand_ratio // 2
        Quantization->>Quantization: allocate w3_w1 with expand_ratio scaling
        Quantization->>Kernel: pass expand_ratio to C++ quantParams
    end
    
    Backend->>Kernel: forward(..., activation_type)
    Kernel->>Kernel: getQuantParams(..., base_activation_type)<br/>adjust validation per activation type
    Kernel-->>Backend: result
    Backend-->>App: output
sequenceDiagram
    participant Model as NemotronHModel
    participant Layer as NemotronHLayer
    participant MoE as NemotronHMOE
    participant Router as Gate Router
    participant AuxStream as Aux CUDA Stream

    Model->>Model: __init__: create aux_stream_dict<br/>(MoeShared, Overlap, Balancer)
    Model->>Layer: pass aux_stream_dict
    
    Layer->>Layer: route layer_type=="E" to MoE
    Layer->>MoE: new NemotronHMOE(..., aux_stream_dict)
    MoE->>MoE: init latent projections (if enabled)
    MoE->>MoE: init gate and experts
    
    Layer->>MoE: forward(hidden_states)
    MoE->>Router: compute routing weights
    
    par Parallel Execution
        MoE->>MoE: shared path through gate
        MoE->>AuxStream: route to MoeShared stream
    and
        MoE->>MoE: expert path computation
        MoE->>AuxStream: route to MoeChunkingOverlap stream
    end
    
    AuxStream->>MoE: synchronize outputs
    MoE-->>Layer: combined result
    Layer-->>Model: propagate output

Estimated code review effort

๐ŸŽฏ 4 (Complex) | โฑ๏ธ ~60 minutes

  • C++ quantization logic (moeOp.cpp): New branching on expand_ratio requires verification against all quantization paths (MXFP4, NVFP8, NVFP4) to ensure size calculations remain correct and error messages align.
  • New NemotronHMOE module (modeling_nemotron_h.py): Introduces untested parallel execution with auxiliary streams, latent projection logic, and new layer routing; requires careful verification of stream management and synchronization correctness.
  • Activation-type propagation across multiple MoE backends: Dense, interconnected parameter threading through factory, interface, vanilla, and Cutlass implementations; each backend's gating-aware expert initialization requires independent reasoning.
  • Weight remapping complexity (nemotron_h_weight_mapper.py): Non-trivial MoE expert weight transformation logic (up_proj โ†’ w1/w3 splitting, scale handling per backend) with multiple error paths that need coverage testing.
  • Quantization weight shape updates (quantization.py): Widespread replacement of factor-2 with expand_ratio across multiple quantization variants (FP8, NVFP4, TRT) needs verification that slicing logic produces correct tensor dimensions for both gated and non-gated activations.
  • Test parameterization (test_modeling_nemotron_h.py): Model-specific reference values and conditional tolerance logic; verify that each model's expected outputs and skip conditions are correctly mapped.

Pre-merge checks and finishing touches

โŒ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check โš ๏ธ Warning The PR description includes a Features section listing key capabilities, but the required Description and Test Coverage sections are missing (only contain placeholder comments), and the PR checklist is incomplete. Complete the Description section explaining what was changed and why, and the Test Coverage section listing relevant tests. Ensure all PR checklist items are properly addressed.
Docstring Coverage โš ๏ธ Warning Docstring coverage is 5.71% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
โœ… Passed checks (1 passed)
Check name Status Explanation
Title check โœ… Passed The title clearly identifies the main feature: adding support for nano-v3 and super-v3 models with the PyTorch backend, matching the changes throughout the pull request.
โœจ Finishing touches
  • [ ] ๐Ÿ“ Generate docstrings
๐Ÿงช Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

[!TIP]

๐Ÿ“ Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests โ€” including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. ๐Ÿ“ Description โ€” Summarize the main change in 50โ€“60 words, explaining what was done.
  2. ๐Ÿ““ References โ€” List relevant issues, discussions, documentation, or related PRs.
  3. ๐Ÿ“ฆ Dependencies & Requirements โ€” Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. ๐Ÿ“Š Contributor Summary โ€” Include a Markdown table showing contributions: | Contributor | Lines Added | Lines Removed | Files Changed |
  5. โœ”๏ธ Additional Notes โ€” Add any extra reviewer context. Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

โค๏ธ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Nov 19 '25 07:11 coderabbitai[bot]

/bot run

Wanli-Jiang avatar Nov 20 '25 07:11 Wanli-Jiang

/bot run

ZhanruiSunCh avatar Nov 20 '25 08:11 ZhanruiSunCh

PR_Github #25179 [ run ] triggered by Bot. Commit: 94e7884

tensorrt-cicd avatar Nov 20 '25 08:11 tensorrt-cicd

PR_Github #25179 [ run ] completed with state SUCCESS. Commit: 94e7884 /LLM/main/L0_MergeRequest_PR pipeline #19038 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 20 '25 11:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 21 '25 01:11 Wanli-Jiang

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 21 '25 02:11 Wanli-Jiang

PR_Github #25280 [ run ] triggered by Bot. Commit: ce6a583

tensorrt-cicd avatar Nov 21 '25 02:11 tensorrt-cicd

PR_Github #25280 [ run ] completed with state FAILURE. Commit: ce6a583 /LLM/main/L0_MergeRequest_PR pipeline #19125 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 21 '25 03:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 21 '25 08:11 Wanli-Jiang

PR_Github #25340 [ run ] triggered by Bot. Commit: ec1fc19

tensorrt-cicd avatar Nov 21 '25 08:11 tensorrt-cicd

PR_Github #25340 [ run ] completed with state FAILURE. Commit: ec1fc19 /LLM/main/L0_MergeRequest_PR pipeline #19167 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 21 '25 19:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 24 '25 07:11 Wanli-Jiang

PR_Github #25520 [ run ] triggered by Bot. Commit: ec1fc19

tensorrt-cicd avatar Nov 24 '25 07:11 tensorrt-cicd

PR_Github #25520 [ run ] completed with state SUCCESS. Commit: ec1fc19 /LLM/main/L0_MergeRequest_PR pipeline #19326 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 24 '25 12:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 25 '25 01:11 Wanli-Jiang

PR_Github #25627 [ run ] triggered by Bot. Commit: e72c071

tensorrt-cicd avatar Nov 25 '25 01:11 tensorrt-cicd

PR_Github #25627 [ run ] completed with state SUCCESS. Commit: e72c071 /LLM/main/L0_MergeRequest_PR pipeline #19416 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 25 '25 06:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 25 '25 07:11 Wanli-Jiang

PR_Github #25681 [ run ] triggered by Bot. Commit: 855ea7b

tensorrt-cicd avatar Nov 25 '25 07:11 tensorrt-cicd

PR_Github #25681 [ run ] completed with state SUCCESS. Commit: 855ea7b /LLM/main/L0_MergeRequest_PR pipeline #19464 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 25 '25 13:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 25 '25 13:11 Wanli-Jiang

PR_Github #25745 [ run ] triggered by Bot. Commit: 5d91bd3

tensorrt-cicd avatar Nov 25 '25 13:11 tensorrt-cicd

PR_Github #25745 [ run ] completed with state SUCCESS. Commit: 5d91bd3 /LLM/main/L0_MergeRequest_PR pipeline #19522 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 25 '25 18:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 26 '25 01:11 Wanli-Jiang

PR_Github #25793 [ run ] triggered by Bot. Commit: 5d91bd3

tensorrt-cicd avatar Nov 26 '25 01:11 tensorrt-cicd

PR_Github #25793 [ run ] completed with state FAILURE. Commit: 5d91bd3 /LLM/main/L0_MergeRequest_PR pipeline #19565 completed with status: 'FAILURE'

tensorrt-cicd avatar Nov 26 '25 07:11 tensorrt-cicd

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 26 '25 11:11 Wanli-Jiang

/bot run --disable-fail-fast

Wanli-Jiang avatar Nov 27 '25 01:11 Wanli-Jiang

PR_Github #25923 [ run ] triggered by Bot. Commit: 5d91bd3

tensorrt-cicd avatar Nov 27 '25 01:11 tensorrt-cicd