What does this PR do ?

Support virtual pipeline parallel (vpp) in mcore

Issues

closes #1038

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
[ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added multi-model support for training, inference, checkpoints, and exports.
- Sequence packing now adapts to data/pipeline parallelism with configurable bin counts for improved throughput.
- Improved handling for virtual pipeline stages with per-stage data iterators.
Refactor
- Internal model handling updated to operate on multiple sub-models consistently across workflows.
Chores
- Updated Megatron-Bridge workspace submodule (no user-visible changes).

Sep 15 '25 17:09 guyueh1

📝 Walkthrough

Walkthrough

Updates include a submodule pointer bump, introduction of two optional sequence packing parameters, pipeline-parallel-aware binning in LM policy training, broad refactor of MegatronPolicyWorker to handle a list of models (multi-model/pipeline/virtual-pipeline aware), and a corresponding utility change to read modules from the first model in the list.

Changes

Cohort / File(s)	Summary
Submodule pointer update `3rdparty/Megatron-Bridge-workspace/Megatron-Bridge`	Submodule reference updated from abd52c89... to 30f24c66.... No code or API changes in this repo.
Sequence packing args extension `nemo_rl/distributed/batched_data_dict.py`	Added optional fields to SequencePackingArgs: `min_bin_count`, `bin_count_multiple`. `shard_by_batch_size` passes these to the bin packer, defaulting to shard count when falsy.
PP-aware sequence packing config `nemo_rl/models/policy/lm_policy.py`	Computes `pp_size`; when sequence packing is used, sets `min_bin_count` and `bin_count_multiple` to `dp_size * pp_size`. No changes to dynamic batching path.
Multi-model worker refactor and pipeline data flow `nemo_rl/models/policy/megatron_policy_worker.py`	Treats `self.model` as a list; updates hooks, training, device moves, checkpointing, exports, parameter updates, and caches to iterate per sub-model. Adjusts data iterator to list when `virtual_pipeline_model_parallel_size > 1`. Configures overlap of P2P comm for multi-stage pipelines. Adds rank-aware/log prints. No public API signature changes.
Utils aligned to model list `nemo_rl/models/policy/utils.py`	`get_gpu_info` now inspects modules from `model[0].named_modules()` instead of `model.named_modules()`. Return structure unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Trainer
  participant LMPolicy
  participant ParallelConfig as Parallel Config
  participant SeqPack as Sequence Packing Args
  participant BinPacker

  Trainer->>LMPolicy: start_training()
  LMPolicy->>ParallelConfig: get dp_size, pp_size
  ParallelConfig-->>LMPolicy: dp_size, pp_size
  LMPolicy->>SeqPack: set min_bin_count = dp_size*pp_size\nset bin_count_multiple = dp_size*pp_size
  LMPolicy->>BinPacker: construct(..., min_bin_count, bin_count_multiple)
  BinPacker-->>LMPolicy: ready
  LMPolicy-->>Trainer: proceed with training

sequenceDiagram
  autonumber
  actor Orchestrator
  participant Worker as MegatronPolicyWorker
  participant Models as [Model_0, Model_1, ...]
  participant DataIt as Data Iterator(s)
  participant Checkpoint as Checkpoint/Export

  Orchestrator->>Worker: initialize(models=list)
  Worker->>Models: for each: move to device, set hooks
  alt virtual_pipeline_model_parallel_size > 1
    Worker->>DataIt: create list of iterators (per VP stage)
  else
    Worker->>DataIt: create single iterator
  end
  loop training steps
    Worker->>Models: for each: zero grads, reset caches
    Worker->>Models: forward/backward (passes list to step fn)
  end
  Worker->>Checkpoint: save(model=list), export weights
  Worker-->>Orchestrator: done

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump with glee on parallel trails,
Bins now hop in DP×PP scales.
A warren of models, lined in a row,
Each takes a nibble, then onward we go.
Pack, iterate, checkpoint—done!
Carrots aligned, workloads run.
Happy hops under the training sun.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'perf: Support mcore virtual pipeline parallel' is partially related to the changeset. While virtual pipeline parallel support is a significant aspect of the changes (particularly in megatron_policy_worker.py), the title uses 'perf:' prefix suggesting performance optimization, but the changes are primarily feature additions (multi-model support, new optional fields, pipeline parallel awareness) rather than performance improvements. The title captures one aspect but undersells the broader scope of the changes.

✨ Finishing touches

[ ] 📝 Generate docstrings

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Sep 15 '25 17:09 coderabbitai[bot]

@yaoyu-33 @yfw can we get a review for this ?

Oct 31 '25 07:10 euronymous-aithal

@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases

Nov 06 '25 18:11 guyueh1

@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases

It would be a small change on top of this. I haven't opened it yet as I got pulled into something urgent. I'll try to get it open with your base vpp changes in as well by tomorrow.

Nov 06 '25 19:11 parthmannan

@parthmannan let's merge all changes related to vpp in one PR; I can drive this, if you just merge your changes to my branch (guyueh1/support_mcore_vpp)

Nov 09 '25 21:11 guyueh1

feat: Support mcore virtual pipeline parallel

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches