Megatron-LM [QUESTION] why WrappedTorchLayerNorm sequence parallel not supported by torch LayerNorm？

@cuichenx just like TENorm add # Set flag for sequence parallelism (custom Megatron-LM integration) if getattr(self, "sequence_parallel", None) is not None: self.weight.sequence_parallel = self.sequence_parallel self.bias.sequence_parallel = self.sequence_parallel to WrappedTorchLayerNorm class can support sequence parallel ？

Mar 05 '25 06:03 mollon650

No, changing the wrapper alone would not work. You would need the underlying implementation to support sequence parallelism

Mar 05 '25 16:03 cuichenx

@cuichenx why changing the wrapper alone would not work？ RMSNorm (Root Mean Square Normalization) does not operate across tokens; rather, it normalizes independently for each token. Specifically, RMSNorm is applied across the hidden (feature) dimension for each token separately. so RMSNorm is naturally compatible with sequence parallelism, as each device can compute RMSNorm locally without needing synchronization or collective communication. i reference Tenorm code

Mar 06 '25 02:03 mollon650

@cuichenx any suggestion？

Mar 11 '25 02:03 mollon650

Marking as stale. No activity in 60 days.

May 10 '25 18:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 28 '25 02:07 github-actions[bot]