torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

[Feature request] Integrate DeepGEMM

Open lessw2020 opened this issue 10 months ago • 8 comments

This might accelerate our MoE computation: https://github.com/deepseek-ai/DeepGEMM

Original request from: @ericxsun

lessw2020 avatar Feb 26 '25 04:02 lessw2020

Well the gemms are very performant but these are inference only. They didn't release the backward portion ala wgrad. From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.

lessw2020 avatar Feb 28 '25 09:02 lessw2020

working on triton implementation to support both inference and training. Bf16 version forward in testing now.

lessw2020 avatar Mar 02 '25 20:03 lessw2020

mark👀

rbao2018 avatar Apr 05 '25 15:04 rbao2018

Progress update: We have landed a forward MG * NG group Gemm for deepseek inference this week (bf16)...you can run it using generate.py.
This also has backward kernels but needs some touchups to match the NG portion (originally the groupGEMM was MG * N).

I also have a triton equivalent of deepseek contiguous groupGemm in another pr with forward and backward. This is bf16 but will add fp8 so we can benchmark vs deepseek deepGemm.

For reference, DeepGemm has two versions, contiguous and masked where masked is for decoding. Ultimately we will compare all these versions and go with the most performant and flexible options. There are also multiple additional kernels in progress to accelerate the entire MoE layers. (sorting and permutation for tokens). More updates over the next two weeks!

lessw2020 avatar Apr 05 '25 16:04 lessw2020

Amazing work @lessw2020!

From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.

It seems https://github.com/deepseek-ai/DeepGEMM/pull/95 just merged the training kernels

vwxyzjn avatar Jul 21 '25 18:07 vwxyzjn

Thanks @vwxyzjn for the update! We have a cleaner version of deepseek now, so we can potentially integrate there, or just jump to mxfp8 directly.

lessw2020 avatar Jul 21 '25 19:07 lessw2020

Hi,

Is there any update on this?

ajWithNucleus avatar Nov 09 '25 08:11 ajWithNucleus

hi @ajWithNucleus, I'm no longer working on Titan but maybe @tianyu-l or @danielvegamyhre can provide an update if any plans to integrate these training kernels.

lessw2020 avatar Nov 09 '25 22:11 lessw2020