NeMo
NeMo copied to clipboard
Add support for Apex distributed Adam optimizer with GPT-3
What does this PR do ?
Adds support for training GPT-3 with the Apex implementation of the ZeRO optimizer.
Collection: NLP
Changelog
- Add option for
distributed_fused_adam
optimizer and integrate with GPT-3 training
Usage
Set optimizer to distributed_fused_adam
in config file:
https://github.com/NVIDIA/NeMo/blob/23f6a95fb1479f5d2ff7111adbc5720ed3c58e48/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L122
~Note that this optimizer is incompatible with the megatron_amp_O2
option. It internally applies similar optimizations (bf16 params, wgrad kernel fusion), so refactoring may be warranted in the future.~ This optimizer requires the megatron_amp_O2
option.
Before your PR is "Ready for review"
Pre checks:
- [x] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [x] Did you add or update any necessary documentation?
- [x] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [x] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Depends on https://github.com/NVIDIA/apex/pull/1414.
- Depends on https://github.com/NVIDIA/apex/pull/1432
- This is mostly orthogonal to https://github.com/NVIDIA/NeMo/pull/4380. It just requires changing this line to:
if self.megatron_amp_o2 or self.with_distributed_adam:
This pull request introduces 1 alert when merging 8a0e45d63049feb174555bcf48cd8571a3af008f into 468a3f3779f41f9c67bd8012e47a585a424dadf1 - view on LGTM.com
new alerts:
- 1 for Unused import
I've rebased to incorporate the sequence parallelism support from https://github.com/NVIDIA/NeMo/pull/4380. Pinging @ericharper.
I've made the distributed optimizer dependent on megatron_amp_O2
instead of being mutually exclusive. I'm not convinced it simplifies the implementation so much as it shifts around the messiness, but it is definitely more intuitive for users. It also hints that this is an experimental feature.
This pull request introduces 1 alert when merging 1d1994eade758965da06f338f9b7451e76092d93 into f921ebe0436e55f7547b183ca83a623f6678422d - view on LGTM.com
new alerts:
- 1 for Unused import
This pull request introduces 1 alert when merging 2e359f5e384e34c576b4f1837e6d43a965bb9d11 into 4cd9b3449cbfedc671348fbabbe8e3a55fbd659d - view on LGTM.com
new alerts:
- 1 for Unused import
This pull request introduces 1 alert when merging 4e46c53ed417ac424dcc4751c2b4bbec50cf2604 into 38cfcd9aafbceed58f7bc58fd8fe371ce7eb3437 - view on LGTM.com
new alerts:
- 1 for Unused import