NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Add Mcore DistributedDataParallel and distributed optimizer into Nemo

Open gdengk opened this issue 10 months ago • 4 comments

What does this PR do ?

Add Mcore DistributedDataParallel and distributed optimizer into Nemo examples/nlp/language_modeling/megatron_gpt_pretraining.py

Changelog

  • Port DistributedDataParallel to Nemo
  • Add a wrapper McoreDistributedOptimizer to bypass torch/PTL assertion check
  • Add the optim name mcore_distributed_optim to turn on mcore distributed optimizer and a few other mcore related flags (details in next section).
  • Verified memory and accuracy between mcore and apex optimizer.

Usage

  • Mcore distributed optimizer usage example is as below:
optim:
    name: mcore_distributed_optim
    overlap_grad_sync: false 
    overlap_param_sync: false 
    grad_sync_dtype: fp32
    delay_param_gather: false
    delay_grad_reduce: True
    ddp_bucket_size: null
    check_for_nan_in_grad: false
    lr: 0.00012
    weight_decay: 0.1
    betas:
    - 0.9
    - 0.95

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI. The GitHub Actions CI will run automatically when the PR is opened. To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • [ ] Make sure you read and followed Contributor guidelines
  • [ ] Did you write any new necessary tests?
  • [ ] Did you add or update any necessary documentation?
  • [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [x] New Feature
  • [ ] Bugfix
  • [ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

gdengk avatar Apr 24 '24 20:04 gdengk

  • EP=2 NGPU=8
  • EP=1 TP=1 NGPU=8 These also work correctly

akoumpa avatar Apr 26 '24 04:04 akoumpa

@ericharper Can you trigger test?

erhoo82 avatar Apr 26 '24 21:04 erhoo82

the tests are triggered, only need to add the "Run CICD" label

ericharper avatar Apr 26 '24 21:04 ericharper

@ericharper will the tests be re-triggered automatically after I made new changes with the CICD label? It looks like it still needs to be manually re-triggered.

gdengk avatar Apr 26 '24 23:04 gdengk