TransformerEngine
TransformerEngine copied to clipboard
Need all-reduce for norm weight gradients with sequence parallel
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/layernorm_linear.py#L461-L471
When we use sequence parallel we need all-reduce norm weight gradients after the code above among TP groups?
Sorry for the late reply. Yes, that's correct. Currently we expect that this all-reduce happens outside of TE, which allows us to coalesce multiple all-reduces into a single NCCL call.
Megatron-LM: https://github.com/NVIDIA/Megatron-LM/blob/52f13005148afa47a6f37b082083fa2c6675ae3e/megatron/optimizer/optimizer.py#L244
NeMo: https://github.com/NVIDIA/NeMo/blob/a9fb58bcee7bff0e50a621d05ec3a9b5eb5f584c/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L696
Hi @timmoon10 Is this still valid? Do we still need to handle the all-reduce outside of TE?