TransformerEngine Does TransformerEngine support FP8 communication such like all-gather or all-to-all?

Does TransformerEngine support FP8 communication such like all-gather or all-to-all?

Open zigzagcai opened this issue 8 months ago • 4 comments

In MoE model architectures, especially when the model size is quite large. We found the throughput is limited by communication (all-gather / reduce-scatter / all-to-all). Where all-gather and reduce-scatter mainly used in ZeRO3 or FSDP, all-to-all mainly used in expert parallelism. The communication is quite large, and finally becomes the bottleneck.

We found another FP8 library named torchao has FP8 all-gather communication enabled. But I cannot find the similar FP8 communication API provided in TE. So, does TransformerEngine support FP8 communication suck like all-gather/reduce-scatter or all-to-all?

Mar 14 '25 10:03 zigzagcai

I think fp8 all-gather should be already supported in TE (_all_gather_fp8).

Mar 17 '25 10:03 BestJuly

It depends on the type of communication. For FP8 with delayed scaling:

Tensor-parallel communication: all-gather in FP8 (see _all_gather_fp8), reduce-scatter in BF16 (see reduce_scatter_along_first_dim)
PyTorch FSDP: param all-gather in FP8 (see _fsdp_gather_tensors), grad reduce-scatter in BF16
Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see DistributedOptimizer)
Megatron-LM MoE token dispatcher (see MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see _AllToAll)

Mar 17 '25 18:03 timmoon10

It depends on the type of communication. For FP8 with delayed scaling:

Tensor-parallel communication: all-gather in FP8 (see _all_gather_fp8), reduce-scatter in BF16 (see reduce_scatter_along_first_dim)

PyTorch FSDP: param all-gather in FP8 (see _fsdp_gather_tensors), grad reduce-scatter in BF16

Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see DistributedOptimizer)

Megatron-LM MoE token dispatcher (see MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see _AllToAll)

Thank you! @timmoon10 @BestJuly

Just another question, does TE has plans to support FP8 all-to-all like what DeepEP has done?

Apr 01 '25 03:04 zigzagcai

Just another question, does TE has plans to support FP8 all-to-all like what DeepEP has done?

TE will provide necessary APIs and the final integration of DeepEP with FP8 will be in Megatron. BTW, we have already integrated DeepEP with BF16 in Megatron.

Apr 08 '25 05:04 yaox12

TransformerEngine TransformerEngine copied to clipboard

Does TransformerEngine support FP8 communication such like all-gather or all-to-all?

TransformerEngine
TransformerEngine copied to clipboard