TransformerEngine Need all-reduce for norm weight gradients with sequence parallel

Need all-reduce for norm weight gradients with sequence parallel

Open jspark1105 opened this issue 2 years ago • 2 comments

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/layernorm_linear.py#L461-L471

When we use sequence parallel we need all-reduce norm weight gradients after the code above among TP groups?

Sep 15 '23 18:09 jspark1105

Sorry for the late reply. Yes, that's correct. Currently we expect that this all-reduce happens outside of TE, which allows us to coalesce multiple all-reduces into a single NCCL call.

Megatron-LM: https://github.com/NVIDIA/Megatron-LM/blob/52f13005148afa47a6f37b082083fa2c6675ae3e/megatron/optimizer/optimizer.py#L244

NeMo: https://github.com/NVIDIA/NeMo/blob/a9fb58bcee7bff0e50a621d05ec3a9b5eb5f584c/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L696

Sep 22 '23 16:09 timmoon10

Hi @timmoon10 Is this still valid? Do we still need to handle the all-reduce outside of TE?

Jun 06 '24 21:06 SuhitK

TransformerEngine TransformerEngine copied to clipboard

Need all-reduce for norm weight gradients with sequence parallel

TransformerEngine
TransformerEngine copied to clipboard