TransformerEngine
TransformerEngine copied to clipboard
Does TransformerEngine support FP8 communication such like all-gather or all-to-all?
In MoE model architectures, especially when the model size is quite large. We found the throughput is limited by communication (all-gather / reduce-scatter / all-to-all). Where all-gather and reduce-scatter mainly used in ZeRO3 or FSDP, all-to-all mainly used in expert parallelism. The communication is quite large, and finally becomes the bottleneck.
We found another FP8 library named torchao has FP8 all-gather communication enabled. But I cannot find the similar FP8 communication API provided in TE.
So, does TransformerEngine support FP8 communication suck like all-gather/reduce-scatter or all-to-all?
I think fp8 all-gather should be already supported in TE (_all_gather_fp8).
It depends on the type of communication. For FP8 with delayed scaling:
- Tensor-parallel communication: all-gather in FP8 (see
_all_gather_fp8), reduce-scatter in BF16 (seereduce_scatter_along_first_dim) - PyTorch FSDP: param all-gather in FP8 (see
_fsdp_gather_tensors), grad reduce-scatter in BF16 - Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see
DistributedOptimizer) - Megatron-LM MoE token dispatcher (see
MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see_AllToAll)
It depends on the type of communication. For FP8 with delayed scaling:
- Tensor-parallel communication: all-gather in FP8 (see
_all_gather_fp8), reduce-scatter in BF16 (seereduce_scatter_along_first_dim)- PyTorch FSDP: param all-gather in FP8 (see
_fsdp_gather_tensors), grad reduce-scatter in BF16- Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see
DistributedOptimizer)- Megatron-LM MoE token dispatcher (see
MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see_AllToAll)
Thank you! @timmoon10 @BestJuly
Just another question, does TE has plans to support FP8 all-to-all like what DeepEP has done?
Just another question, does TE has plans to support FP8 all-to-all like what DeepEP has done?
TE will provide necessary APIs and the final integration of DeepEP with FP8 will be in Megatron. BTW, we have already integrated DeepEP with BF16 in Megatron.