torchtitan [Feature request] Integrate DeepGEMM

This might accelerate our MoE computation: https://github.com/deepseek-ai/DeepGEMM

Original request from: @ericxsun

Feb 26 '25 04:02 lessw2020

Well the gemms are very performant but these are inference only. They didn't release the backward portion ala wgrad. From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.

Feb 28 '25 09:02 lessw2020

working on triton implementation to support both inference and training. Bf16 version forward in testing now.

Mar 02 '25 20:03 lessw2020

mark👀

Apr 05 '25 15:04 rbao2018

Progress update: We have landed a forward MG * NG group Gemm for deepseek inference this week (bf16)...you can run it using generate.py.
This also has backward kernels but needs some touchups to match the NG portion (originally the groupGEMM was MG * N).

I also have a triton equivalent of deepseek contiguous groupGemm in another pr with forward and backward. This is bf16 but will add fp8 so we can benchmark vs deepseek deepGemm.

For reference, DeepGemm has two versions, contiguous and masked where masked is for decoding. Ultimately we will compare all these versions and go with the most performant and flexible options. There are also multiple additional kernels in progress to accelerate the entire MoE layers. (sorting and permutation for tokens). More updates over the next two weeks!

Apr 05 '25 16:04 lessw2020

Amazing work @lessw2020!

From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.

It seems https://github.com/deepseek-ai/DeepGEMM/pull/95 just merged the training kernels

Jul 21 '25 18:07 vwxyzjn

Thanks @vwxyzjn for the update! We have a cleaner version of deepseek now, so we can potentially integrate there, or just jump to mxfp8 directly.

Jul 21 '25 19:07 lessw2020

Hi,

Is there any update on this?

Nov 09 '25 08:11 ajWithNucleus

hi @ajWithNucleus, I'm no longer working on Titan but maybe @tianyu-l or @danielvegamyhre can provide an update if any plans to integrate these training kernels.

Nov 09 '25 22:11 lessw2020