Xin Yao comments

Results 123 comments of


                                            Xin Yao

Question about the performace of GroupedLinear

Can you summary your questions into the following two? 1. Q: Why is there such a large difference in duration between `nvte_multi_stream_cublas_gemm` and `TERowParallelGroupedLinear`? A: It's the CPU overheads of...

Question about the performace of GroupedLinear

Could it be the balance issue in MoE training? Can you try drop and pad by setting ```python moe_token_drop_policy="probs" moe_expert_capacity_factor=1.0 moe_pad_expert_input_to_capacity=True ``` and see if the issue still happens?

Question about the performace of GroupedLinear

I can reproduce the issue. I ran two GroupedLinear layers for 1000 iterations using the above snippet and observed 3 slowdowns, that's iter 357, 533, and 710. Then I checked...

Does TransformerEngine support FP8 communication such like all-gather or all-to-all?

> Just another question, does TE has plans to support FP8 all-to-all like what [DeepEP ](https://github.com/deepseek-ai/DeepEP) has done? TE will provide necessary APIs and the final integration of DeepEP with...

Deadline or schedule new update supporting blackwell and fp4?

TE 2.0 now supports MXFP8 training on Blackwell.

`Float8Quantizer::create_tensor` calculates `scale_inv` instead of creating an empty buffer

@timmoon10 I totally understand what #1083 was doing. My question is why we changed away from it in TE 2.0, for example, calculating `scale_inv` in `Float8Quantizer::create_tensor` instead of in the...

Transformer Engine using FlashAttention V3

FA3 support is added in https://github.com/NVIDIA/TransformerEngine/pull/1019.

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？

We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training. We're...

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？

> Hi Xin, > > Could we expect an ETA on this? See PR https://github.com/NVIDIA/TransformerEngine/pull/1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is...

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？

> > > Hi Xin, > > > Could we expect an ETA on this? > > > > > > See PR [#1559](https://github.com/NVIDIA/TransformerEngine/pull/1559). Note you will need CUDA 12.9...