What does this PR do ?

Adds support for training GPT-3 with the Apex implementation of the ZeRO optimizer.

Collection: NLP

Changelog

Add option for distributed_fused_adam optimizer and integrate with GPT-3 training

Usage

Set optimizer to distributed_fused_adam in config file:

https://github.com/NVIDIA/NeMo/blob/23f6a95fb1479f5d2ff7111adbc5720ed3c58e48/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L122

~Note that this optimizer is incompatible with the megatron_amp_O2 option. It internally applies similar optimizations (bf16 params, wgrad kernel fusion), so refactoring may be warranted in the future.~ This optimizer requires the megatron_amp_O2 option.

Before your PR is "Ready for review"

Pre checks:

[x] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[x] Did you add or update any necessary documentation?
[x] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [x] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Depends on https://github.com/NVIDIA/apex/pull/1414.
Depends on https://github.com/NVIDIA/apex/pull/1432
This is mostly orthogonal to https://github.com/NVIDIA/NeMo/pull/4380. It just requires changing this line to: if self.megatron_amp_o2 or self.with_distributed_adam:

Jul 01 '22 19:07 timmoon10

This pull request introduces 1 alert when merging 8a0e45d63049feb174555bcf48cd8571a3af008f into 468a3f3779f41f9c67bd8012e47a585a424dadf1 - view on LGTM.com

new alerts:

1 for Unused import

Jul 25 '22 21:07 lgtm-com[bot]

I've rebased to incorporate the sequence parallelism support from https://github.com/NVIDIA/NeMo/pull/4380. Pinging @ericharper.

Jul 26 '22 20:07 timmoon10

I've made the distributed optimizer dependent on megatron_amp_O2 instead of being mutually exclusive. I'm not convinced it simplifies the implementation so much as it shifts around the messiness, but it is definitely more intuitive for users. It also hints that this is an experimental feature.

Aug 03 '22 07:08 timmoon10

This pull request introduces 1 alert when merging 1d1994eade758965da06f338f9b7451e76092d93 into f921ebe0436e55f7547b183ca83a623f6678422d - view on LGTM.com

new alerts:

1 for Unused import

Aug 10 '22 00:08 lgtm-com[bot]

This pull request introduces 1 alert when merging 2e359f5e384e34c576b4f1837e6d43a965bb9d11 into 4cd9b3449cbfedc671348fbabbe8e3a55fbd659d - view on LGTM.com

new alerts:

1 for Unused import

Aug 11 '22 17:08 lgtm-com[bot]

This pull request introduces 1 alert when merging 4e46c53ed417ac424dcc4751c2b4bbec50cf2604 into 38cfcd9aafbceed58f7bc58fd8fe371ce7eb3437 - view on LGTM.com

new alerts:

1 for Unused import

Aug 15 '22 22:08 lgtm-com[bot]

NeMo
NeMo copied to clipboard

Add support for Apex distributed Adam optimizer with GPT-3

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo NeMo copied to clipboard

Add support for Apex distributed Adam optimizer with GPT-3

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

NeMo
NeMo copied to clipboard