Xin Yao
Xin Yao
Can you summary your questions into the following two? 1. Q: Why is there such a large difference in duration between `nvte_multi_stream_cublas_gemm` and `TERowParallelGroupedLinear`? A: It's the CPU overheads of...
Could it be the balance issue in MoE training? Can you try drop and pad by setting ```python moe_token_drop_policy="probs" moe_expert_capacity_factor=1.0 moe_pad_expert_input_to_capacity=True ``` and see if the issue still happens?
I can reproduce the issue. I ran two GroupedLinear layers for 1000 iterations using the above snippet and observed 3 slowdowns, that's iter 357, 533, and 710. Then I checked...
> Just another question, does TE has plans to support FP8 all-to-all like what [DeepEP ](https://github.com/deepseek-ai/DeepEP) has done? TE will provide necessary APIs and the final integration of DeepEP with...
TE 2.0 now supports MXFP8 training on Blackwell.
@timmoon10 I totally understand what #1083 was doing. My question is why we changed away from it in TE 2.0, for example, calculating `scale_inv` in `Float8Quantizer::create_tensor` instead of in the...
FA3 support is added in https://github.com/NVIDIA/TransformerEngine/pull/1019.
We will not add DeepGEMM into TE because it lacks the GEMM for wgrad (1x128 by 1x128) in back propagation. And its JIT mechanism brings non-negligible overheads in training. We're...
> Hi Xin, > > Could we expect an ETA on this? See PR https://github.com/NVIDIA/TransformerEngine/pull/1559. Note you will need CUDA 12.9 to run it because the groupwise/blockwise FP8 GEMM is...
> > > Hi Xin, > > > Could we expect an ETA on this? > > > > > > See PR [#1559](https://github.com/NVIDIA/TransformerEngine/pull/1559). Note you will need CUDA 12.9...