Tim Moon

Results 80 comments of Tim Moon

I've fixed a bug related to gradient scaling that only affected FP16 training (see https://github.com/NVIDIA/apex/pull/1512). BF16 training shouldn't be affected. This PR is ready to merge. Pinging @erhoo82.

Running on a DGX A100 node for 50 steps with 2-way data, tensor, and pipeline parallelism, I see nearly identical learning behavior with and without the distributed optimizer: | Model...

With https://github.com/NVIDIA/apex/pull/1514 the distributed optimizer supports interleaved pipeline parallelism. Running GPT-2 124M for 20 steps, I get the same loss values with and without the distributed optimizer.

Interesting, incorporating the distributed optimizer into that is a good idea. I'm using the [NeMo grad scaler](https://github.com/NVIDIA/NeMo/blob/18940b3b32cff290cf70d4a251b0e2f7b08e1525/nemo/collections/nlp/parts/nlp_overrides.py#L395), which seems to be an extension of the Apex version.

I've updated the Apex grad scaler to match the changes I've made in the NeMo grad scaler.

For reference, most frameworks treat attention mask `True`s as inclusion: | Implementation | `True` in attention mask implies inclusion | |----|----| | [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) | No | | [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention) | Yes...