What does this PR do ?

Generalize distributed Adam support for GPT-3 to T5 and other Megatron-LM models. It also implements several performance optimizations.

Collection: NLP

Changelog

When params are BF16, distributed Adam will store 16-bit param remainders instead of FP32 main params
Decouples distributed Adam support from Megatron O2-level optizations
Add support for Apex distributed Adam optimizer with other Megatron-LM models, namely T5
Add support for overlapped grad reductions with pipeline or sequence parallelism

Usage

Set optimizer to distributed_fused_adam in config file:

https://github.com/NVIDIA/NeMo/blob/265f7b17876aef7937cf0fedf7a76885bdc63288/examples/nlp/language_modeling/conf/megatron_t5_config.yaml#L137

Before your PR is "Ready for review"

Pre checks:

[x] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[x] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Closes https://github.com/NVIDIA/NeMo/pull/4799
Depends on https://github.com/NVIDIA/apex/pull/1475

Sep 07 '22 23:09 timmoon10

This pull request introduces 1 alert and fixes 1 when merging 065a89b686c1d2a494ec0ec19559fb221cd55835 into abbe6430e314a0159370e198f16b75dcd75ba3f7 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Sep 07 '22 23:09 lgtm-com[bot]

This pull request introduces 1 alert and fixes 1 when merging 7943ebcee11d1b65936477cc0078238069fe7e96 into b9cf05cf76496b57867d39308028c60fef7cb1ba - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Sep 08 '22 23:09 lgtm-com[bot]

This pull request fixes 1 alert when merging e06d34a0919aa624b7782cf0b7c43f68ebee9d6f into b9cf05cf76496b57867d39308028c60fef7cb1ba - view on LGTM.com

fixed alerts:

1 for Unused import

Sep 09 '22 00:09 lgtm-com[bot]

Running T5 41B on 32 Selene nodes, I see a 1.2x speedup over the pure data-parallel impl, 66% of expected memory savings, and nearly identical loss values after 20 steps.

Full results with T5 41B and GPT-3 175B. The run configurations are detailed inside. Note that I ran with a relatively small global batch size, which makes communication a more significant portion of runtime.

Sep 10 '22 00:09 timmoon10

This pull request fixes 1 alert when merging d528a89eb06405d32b8b92e82ba530eed988a762 into f1825bc4b724b78c2d6ca392b616e8dc9a8cde04 - view on LGTM.com

fixed alerts:

1 for Unused import

Sep 15 '22 20:09 lgtm-com[bot]

This pull request fixes 1 alert when merging 811b59cf0b2ec14fd65d7cc79193808c34b24231 into f1825bc4b724b78c2d6ca392b616e8dc9a8cde04 - view on LGTM.com

fixed alerts:

1 for Unused import

Sep 20 '22 17:09 lgtm-com[bot]

This pull request fixes 1 alert when merging ebd98c405210d7932a2bb9698dd2d1762d9a46ba into 971485ce7fedd7a6d16966f00583fe2736129f52 - view on LGTM.com

fixed alerts:

1 for Unused import

Sep 27 '22 00:09 lgtm-com[bot]

This pull request fixes 1 alert when merging b2a61adab0ae6b337a1da4e36899c15753cc8d81 into 73fcfd7cdf69f375d6398a2930f9d1c03c96db39 - view on LGTM.com

fixed alerts:

1 for Unused import

Sep 27 '22 20:09 lgtm-com[bot]

This pull request fixes 1 alert when merging 7c3551e361e9d410a65f204d80bc92c8b69f98ea into e0cc6b76766f6f1f933be5f798d86fb1a5edd3ee - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 04 '22 19:10 lgtm-com[bot]

This pull request fixes 1 alert when merging 39b3a88bc71fc2a60908db6f59ad0c0eeffc7b56 into 3fda5de46d9e5e3a55c18449a88665f80c34899f - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 11 '22 23:10 lgtm-com[bot]

I've fixed a bug related to gradient scaling that only affected FP16 training (see https://github.com/NVIDIA/apex/pull/1512). BF16 training shouldn't be affected.

This PR is ready to merge. Pinging @erhoo82.

Oct 14 '22 23:10 timmoon10

This pull request fixes 1 alert when merging c3692af24285a51270bdaa0be2fcce712bfddf22 into 18940b3b32cff290cf70d4a251b0e2f7b08e1525 - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 14 '22 23:10 lgtm-com[bot]

This pull request fixes 1 alert when merging 263897410cdbfc84eece469e349b8ba7341a1753 into aa15a99ca793b5131c9e737f001d09c9c498c013 - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 17 '22 20:10 lgtm-com[bot]

This pull request fixes 1 alert when merging a0883042ce051583f619eb4f46dc474f524d8636 into 86565741f22d0c19da07680e2f40bc207c613206 - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 19 '22 23:10 lgtm-com[bot]

Running on a DGX A100 node for 50 steps with 2-way data, tensor, and pipeline parallelism, I see nearly identical learning behavior with and without the distributed optimizer:

Model	ZeRO	O2	Data type	Throughput	Train loss	Val loss
GPT-2 124M	Yes	No	FP16	2.69it/s	8.4	7.870
GPT-2 124M	No	No	FP16	3.31it/s	8.4	7.870
GPT-2 124M	Yes	No	BF16	3.19it/s	8.28	7.790
GPT-2 124M	No	No	BF16	3.39it/s	8.28	7.790
GPT-2 124M	Yes	Yes	BF16	3.44it/s	8.3	7.800
GPT-2 124M	No	Yes	BF16	3.60it/s	8.3	7.800
T5 220M	Yes	No	FP32	1.46it/s	7.64	7.530
T5 220M	No	No	FP32	1.69it/s	7.64	7.530
T5 220M	Yes	No	FP16	1.43it/s	8.45	8.290
T5 220M	No	No	FP16	1.43it/s	8.45	8.290
T5 220M	Yes	No	BF16	1.50it/s	7.66	7.560
T5 220M	No	No	BF16	1.45it/s	7.65	7.550
T5 220M	Yes	Yes	BF16	1.58it/s	7.65	7.540
T5 220M	No	Yes	BF16	1.61it/s	7.65	7.540

I get runtime failures when I run GPT-2 with FP32 and with pipeline parallelism enabled. This error shows up in the main branch as well.

Oct 20 '22 04:10 timmoon10

This pull request fixes 1 alert when merging aed0e00b1fe5887479e0d1ff191cc249eb89295b into 85fc659e956547f8b23f735badf081eea3f7dbb3 - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 20 '22 04:10 lgtm-com[bot]

This pull request fixes 1 alert when merging 190f9921d00c9ea8b57b1208421910eb3844cad0 into 033600050f1d0e39fde7a807267a55e56092aa94 - view on LGTM.com

fixed alerts:

1 for Unused import

Oct 20 '22 23:10 lgtm-com[bot]

With https://github.com/NVIDIA/apex/pull/1514 the distributed optimizer supports interleaved pipeline parallelism. Running GPT-2 124M for 20 steps, I get the same loss values with and without the distributed optimizer.

Oct 21 '22 01:10 timmoon10

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information