Tim Moon comments

Results 80 comments of


                                            Tim Moon

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

I've fixed a bug related to gradient scaling that only affected FP16 training (see https://github.com/NVIDIA/apex/pull/1512). BF16 training shouldn't be affected. This PR is ready to merge. Pinging @erhoo82.

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

Running on a DGX A100 node for 50 steps with 2-way data, tensor, and pipeline parallelism, I see nearly identical learning behavior with and without the distributed optimizer: | Model...

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

With https://github.com/NVIDIA/apex/pull/1514 the distributed optimizer supports interleaved pipeline parallelism. Running GPT-2 124M for 20 steps, I get the same loss values with and without the distributed optimizer.

Fix bug with grad clipping and distributed Adam

Interesting, incorporating the distributed optimizer into that is a good idea. I'm using the [NeMo grad scaler](https://github.com/NVIDIA/NeMo/blob/18940b3b32cff290cf70d4a251b0e2f7b08e1525/nemo/collections/nlp/parts/nlp_overrides.py#L395), which seems to be an extension of the Apex version.

Fix bug with grad clipping and distributed Adam

I've updated the Apex grad scaler to match the changes I've made in the NeMo grad scaler.

[PyTorch/Jax] Fix attention mask definition, and sliding window for decoder

For reference, most frameworks treat attention mask `True`s as inclusion: | Implementation | `True` in attention mask implies inclusion | |----|----| | [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) | No | | [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention) | Yes...

Tim Moon

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

Fix bug with grad clipping and distributed Adam

Fix bug with grad clipping and distributed Adam

[PyTorch/Jax] Fix attention mask definition, and sliding window for decoder

[PyTorch] Fixed bug with loading calibrated weights

[PyTorch] Refactor FP8 workspaces in linear modules

[PyTorch] Refactor FP8 workspaces in linear modules

[PyTorch] Refactor FP8 workspaces in linear modules