[QUESTION] why WrappedTorchLayerNorm sequence parallel not supported by torch LayerNorm?
@cuichenx just like TENorm add # Set flag for sequence parallelism (custom Megatron-LM integration) if getattr(self, "sequence_parallel", None) is not None: self.weight.sequence_parallel = self.sequence_parallel self.bias.sequence_parallel = self.sequence_parallel to WrappedTorchLayerNorm class can support sequence parallel ?
No, changing the wrapper alone would not work. You would need the underlying implementation to support sequence parallelism
@cuichenx why changing the wrapper alone would not work? RMSNorm (Root Mean Square Normalization) does not operate across tokens; rather, it normalizes independently for each token. Specifically, RMSNorm is applied across the hidden (feature) dimension for each token separately. so RMSNorm is naturally compatible with sequence parallelism, as each device can compute RMSNorm locally without needing synchronization or collective communication. i reference Tenorm code
@cuichenx any suggestion?
Marking as stale. No activity in 60 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.