RL icon indicating copy to clipboard operation
RL copied to clipboard

feat: Support mcore virtual pipeline parallel

Open guyueh1 opened this issue 3 months ago • 5 comments

What does this PR do ?

Support virtual pipeline parallel (vpp) in mcore

Issues

closes #1038

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • [ ] Make sure you read and followed Contributor guidelines
  • [ ] Did you write any new necessary tests?
  • [ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • [ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added multi-model support for training, inference, checkpoints, and exports.
    • Sequence packing now adapts to data/pipeline parallelism with configurable bin counts for improved throughput.
    • Improved handling for virtual pipeline stages with per-stage data iterators.
  • Refactor
    • Internal model handling updated to operate on multiple sub-models consistently across workflows.
  • Chores
    • Updated Megatron-Bridge workspace submodule (no user-visible changes).

guyueh1 avatar Sep 15 '25 17:09 guyueh1

📝 Walkthrough

Walkthrough

Updates include a submodule pointer bump, introduction of two optional sequence packing parameters, pipeline-parallel-aware binning in LM policy training, broad refactor of MegatronPolicyWorker to handle a list of models (multi-model/pipeline/virtual-pipeline aware), and a corresponding utility change to read modules from the first model in the list.

Changes

Cohort / File(s) Summary
Submodule pointer update
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge
Submodule reference updated from abd52c89... to 30f24c66.... No code or API changes in this repo.
Sequence packing args extension
nemo_rl/distributed/batched_data_dict.py
Added optional fields to SequencePackingArgs: min_bin_count, bin_count_multiple. shard_by_batch_size passes these to the bin packer, defaulting to shard count when falsy.
PP-aware sequence packing config
nemo_rl/models/policy/lm_policy.py
Computes pp_size; when sequence packing is used, sets min_bin_count and bin_count_multiple to dp_size * pp_size. No changes to dynamic batching path.
Multi-model worker refactor and pipeline data flow
nemo_rl/models/policy/megatron_policy_worker.py
Treats self.model as a list; updates hooks, training, device moves, checkpointing, exports, parameter updates, and caches to iterate per sub-model. Adjusts data iterator to list when virtual_pipeline_model_parallel_size > 1. Configures overlap of P2P comm for multi-stage pipelines. Adds rank-aware/log prints. No public API signature changes.
Utils aligned to model list
nemo_rl/models/policy/utils.py
get_gpu_info now inspects modules from model[0].named_modules() instead of model.named_modules(). Return structure unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Trainer
  participant LMPolicy
  participant ParallelConfig as Parallel Config
  participant SeqPack as Sequence Packing Args
  participant BinPacker

  Trainer->>LMPolicy: start_training()
  LMPolicy->>ParallelConfig: get dp_size, pp_size
  ParallelConfig-->>LMPolicy: dp_size, pp_size
  LMPolicy->>SeqPack: set min_bin_count = dp_size*pp_size\nset bin_count_multiple = dp_size*pp_size
  LMPolicy->>BinPacker: construct(..., min_bin_count, bin_count_multiple)
  BinPacker-->>LMPolicy: ready
  LMPolicy-->>Trainer: proceed with training
sequenceDiagram
  autonumber
  actor Orchestrator
  participant Worker as MegatronPolicyWorker
  participant Models as [Model_0, Model_1, ...]
  participant DataIt as Data Iterator(s)
  participant Checkpoint as Checkpoint/Export

  Orchestrator->>Worker: initialize(models=list)
  Worker->>Models: for each: move to device, set hooks
  alt virtual_pipeline_model_parallel_size > 1
    Worker->>DataIt: create list of iterators (per VP stage)
  else
    Worker->>DataIt: create single iterator
  end
  loop training steps
    Worker->>Models: for each: zero grads, reset caches
    Worker->>Models: forward/backward (passes list to step fn)
  end
  Worker->>Checkpoint: save(model=list), export weights
  Worker-->>Orchestrator: done

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump with glee on parallel trails,
Bins now hop in DP×PP scales.
A warren of models, lined in a row,
Each takes a nibble, then onward we go.
Pack, iterate, checkpoint—done!
Carrots aligned, workloads run.
Happy hops under the training sun.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'perf: Support mcore virtual pipeline parallel' is partially related to the changeset. While virtual pipeline parallel support is a significant aspect of the changes (particularly in megatron_policy_worker.py), the title uses 'perf:' prefix suggesting performance optimization, but the changes are primarily feature additions (multi-model support, new optional fields, pipeline parallel awareness) rather than performance improvements. The title captures one aspect but undersells the broader scope of the changes.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Sep 15 '25 17:09 coderabbitai[bot]

@yaoyu-33 @yfw can we get a review for this ?

euronymous-aithal avatar Oct 31 '25 07:10 euronymous-aithal

@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases

guyueh1 avatar Nov 06 '25 18:11 guyueh1

@parthmannan do you have a different PR that supports asymmetric VPP? if so maybe we should close this and work on your PR as that covers more cases

It would be a small change on top of this. I haven't opened it yet as I got pulled into something urgent. I'll try to get it open with your base vpp changes in as well by tomorrow.

parthmannan avatar Nov 06 '25 19:11 parthmannan

@parthmannan let's merge all changes related to vpp in one PR; I can drive this, if you just merge your changes to my branch (guyueh1/support_mcore_vpp)

guyueh1 avatar Nov 09 '25 21:11 guyueh1