feat: Support mcore virtual pipeline parallel
What does this PR do ?
Support virtual pipeline parallel (vpp) in mcore
Issues
closes #1038
Usage
- You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
- [ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.
Additional Information
- ...
Summary by CodeRabbit
- New Features
- Added multi-model support for training, inference, checkpoints, and exports.
- Sequence packing now adapts to data/pipeline parallelism with configurable bin counts for improved throughput.
- Improved handling for virtual pipeline stages with per-stage data iterators.
- Refactor
- Internal model handling updated to operate on multiple sub-models consistently across workflows.
- Chores
- Updated Megatron-Bridge workspace submodule (no user-visible changes).
📝 Walkthrough
Walkthrough
Updates include a submodule pointer bump, introduction of two optional sequence packing parameters, pipeline-parallel-aware binning in LM policy training, broad refactor of MegatronPolicyWorker to handle a list of models (multi-model/pipeline/virtual-pipeline aware), and a corresponding utility change to read modules from the first model in the list.
Changes
| Cohort / File(s) | Summary |
|---|---|
Submodule pointer update3rdparty/Megatron-Bridge-workspace/Megatron-Bridge |
Submodule reference updated from abd52c89... to 30f24c66.... No code or API changes in this repo. |
Sequence packing args extensionnemo_rl/distributed/batched_data_dict.py |
Added optional fields to SequencePackingArgs: min_bin_count, bin_count_multiple. shard_by_batch_size passes these to the bin packer, defaulting to shard count when falsy. |
PP-aware sequence packing confignemo_rl/models/policy/lm_policy.py |
Computes pp_size; when sequence packing is used, sets min_bin_count and bin_count_multiple to dp_size * pp_size. No changes to dynamic batching path. |
Multi-model worker refactor and pipeline data flownemo_rl/models/policy/megatron_policy_worker.py |
Treats self.model as a list; updates hooks, training, device moves, checkpointing, exports, parameter updates, and caches to iterate per sub-model. Adjusts data iterator to list when virtual_pipeline_model_parallel_size > 1. Configures overlap of P2P comm for multi-stage pipelines. Adds rank-aware/log prints. No public API signature changes. |
Utils aligned to model listnemo_rl/models/policy/utils.py |
get_gpu_info now inspects modules from model[0].named_modules() instead of model.named_modules(). Return structure unchanged. |
Sequence Diagram(s)
sequenceDiagram
autonumber
actor Trainer
participant LMPolicy
participant ParallelConfig as Parallel Config
participant SeqPack as Sequence Packing Args
participant BinPacker
Trainer->>LMPolicy: start_training()
LMPolicy->>ParallelConfig: get dp_size, pp_size
ParallelConfig-->>LMPolicy: dp_size, pp_size
LMPolicy->>SeqPack: set min_bin_count = dp_size*pp_size\nset bin_count_multiple = dp_size*pp_size
LMPolicy->>BinPacker: construct(..., min_bin_count, bin_count_multiple)
BinPacker-->>LMPolicy: ready
LMPolicy-->>Trainer: proceed with training
sequenceDiagram
autonumber
actor Orchestrator
participant Worker as MegatronPolicyWorker
participant Models as [Model_0, Model_1, ...]
participant DataIt as Data Iterator(s)
participant Checkpoint as Checkpoint/Export
Orchestrator->>Worker: initialize(models=list)
Worker->>Models: for each: move to device, set hooks
alt virtual_pipeline_model_parallel_size > 1
Worker->>DataIt: create list of iterators (per VP stage)
else
Worker->>DataIt: create single iterator
end
loop training steps
Worker->>Models: for each: zero grads, reset caches
Worker->>Models: forward/backward (passes list to step fn)
end
Worker->>Checkpoint: save(model=list), export weights
Worker-->>Orchestrator: done
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
I thump with glee on parallel trails,
Bins now hop in DP×PP scales.
A warren of models, lined in a row,
Each takes a nibble, then onward we go.
Pack, iterate, checkpoint—done!
Carrots aligned, workloads run.
Happy hops under the training sun.
Pre-merge checks and finishing touches
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The PR title 'perf: Support mcore virtual pipeline parallel' is partially related to the changeset. While virtual pipeline parallel support is a significant aspect of the changes (particularly in megatron_policy_worker.py), the title uses 'perf:' prefix suggesting performance optimization, but the changes are primarily feature additions (multi-model support, new optional fields, pipeline parallel awareness) rather than performance improvements. The title captures one aspect but undersells the broader scope of the changes. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
@yaoyu-33 @yfw can we get a review for this ?
@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases
@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases
It would be a small change on top of this. I haven't opened it yet as I got pulled into something urgent. I'll try to get it open with your base vpp changes in as well by tomorrow.
@parthmannan let's merge all changes related to vpp in one PR; I can drive this, if you just merge your changes to my branch (guyueh1/support_mcore_vpp)