NeMo
NeMo copied to clipboard
Add Mcore DistributedDataParallel and distributed optimizer into Nemo
What does this PR do ?
Add Mcore DistributedDataParallel and distributed optimizer into Nemo examples/nlp/language_modeling/megatron_gpt_pretraining.py
Changelog
- Port DistributedDataParallel to Nemo
- Add a wrapper McoreDistributedOptimizer to bypass torch/PTL assertion check
- Add the optim name
mcore_distributed_optim
to turn on mcore distributed optimizer and a few other mcore related flags (details in next section). - Verified memory and accuracy between mcore and apex optimizer.
Usage
- Mcore distributed optimizer usage example is as below:
optim:
name: mcore_distributed_optim
overlap_grad_sync: false
overlap_param_sync: false
grad_sync_dtype: fp32
delay_param_gather: false
delay_grad_reduce: True
ddp_bucket_size: null
check_for_nan_in_grad: false
lr: 0.00012
weight_decay: 0.1
betas:
- 0.9
- 0.95
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment jenkins
on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
- EP=2 NGPU=8
- EP=1 TP=1 NGPU=8 These also work correctly
@ericharper Can you trigger test?
the tests are triggered, only need to add the "Run CICD" label
@ericharper will the tests be re-triggered automatically after I made new changes with the CICD label? It looks like it still needs to be manually re-triggered.