feat(moe): Support placing MTP layers into standalone stages

Open BestJuly opened this issue 1 month ago • 1 comments

What does this PR do ?

Support placing MTP to a standalone stage, e.g., MTP in the 2nd last VPP stage for better VPP balance. This is the PR for main branch, and the corresponding PR1916 for dev branch.

Highlights

This document outlines the design of a new MTP (Multi-Token Prediction) standalone feature. This feature allows MTP layers to be placed into a standalone VPP (Virtual Pipeline Parallelism) stage, rather than being confined to the last VPP stage.

Key Benefits:

Enhanced Flexibility in MTP VPP Placement
- MTP layers can now be positioned in either the last or second-to-last VPP stage, offering greater control over pipeline configuration.
Improved Load Balancing
- The MTP standalone feature isolates transformer block-related computations, which have a similar computational cost to a normal transformer block. This split, with MTP loss calculation remaining in the second-to-last VPP stage, contributes to a more balanced workload.
Minimal Impact on Existing Code Paths
- The design aims for near-zero impact on current code. A minor change involves the final layernorm layer, which will now be placed at the end of the decoder stage. In scenarios where the last stage contains only MTP layers and loss, this layernorm will reside in the second-to-last layer alongside a decoder layer. While this differs from previous implementations (where the final layernorm was always in the last stage), bitwise correctness has been validated under various conditions, including forceful layernorm placement and disabled gradient clipping.

Considerations:

Changes in P2P Communication Shapes
- The need to pass original and MTP hidden states to subsequent VPP stages will alter p2p communication shapes. To mitigate the impact on core communication, the shape is communicated upfront when MTP standalone settings are detected in the pipeline layout.

Design idea

We split MTP computation into two parts

Output head computation.
Others.

The advantage is that we put the MTP loss computation as main, so for the untie embedding part, the MTP loss can use exactly the same code path of the lm loss as they are in the same vpp stage, which is simpler than putting MTP loss calculation in the same vpp stage as MR#2996 so that we can avoid bugs related to this. With the new design, the DSv3 TFLOPs will also be compatible with the #2996 one which included in the 0.13 EA branch.We divide MTP computation into two parts: output head computation and others. This approach offers the advantage of treating MTP loss computation as primary. Consequently, for the untied embedding, the MTP loss can utilize the exact same code path as the LM loss, as they are within the same VPP stage. This simplifies the process compared to placing MTP loss calculation in the same VPP stage as MR#2996, thereby mitigating potential bugs. Furthermore, this new design ensures DSv3 TFLOPs compatibility with the #2996 version included in the 0.13 EA branch.

Perf and results

Perf

The throughput in DSv3 is on par or even higher than 0.13 EA (where we use previous design for this standalone feature).

Convergence

On full DSv3
(Real training mode) Almost overlapped with the main branch settings for the metrics which is related to convergence.

Deterministic correctness check We validate the correctness in mainly two parts (when using different pp layout, can be bitwise aligned when manually disabling grad clipping)

Deterministic alignment between main branch and this MR.
- The same pipeline layout settings;
- Different pipeline layout settings.
Deterministic alignment within this MR to validate the compatibility of 1F1B overlap.

Pre-checks

[ ] I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
[ ] I have added relevant unit tests
[ ] I have added relevant functional tests
[ ] I have added proper typing to my code Typing guidelines
[ ] I have added relevant documentation
[ ] I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

:warning: Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Nov 05 '25 02:11 BestJuly

@deepakn94 can you please take a look at this MR?

Nov 05 '25 15:11 kvareddy

feat(moe): Support placing MTP layers into standalone stages

What does this PR do ?

Highlights

Key Benefits:

Considerations:

Design idea

Perf and results

Perf

Convergence

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

(Step 1): Add PR label `Expert Review`