[Feature request] Integrate DeepGEMM
This might accelerate our MoE computation: https://github.com/deepseek-ai/DeepGEMM
Original request from: @ericxsun
Well the gemms are very performant but these are inference only. They didn't release the backward portion ala wgrad. From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.
working on triton implementation to support both inference and training. Bf16 version forward in testing now.
mark👀
Progress update: We have landed a forward MG * NG group Gemm for deepseek inference this week (bf16)...you can run it using generate.py.
This also has backward kernels but needs some touchups to match the NG portion (originally the groupGEMM was MG * N).
I also have a triton equivalent of deepseek contiguous groupGemm in another pr with forward and backward. This is bf16 but will add fp8 so we can benchmark vs deepseek deepGemm.
For reference, DeepGemm has two versions, contiguous and masked where masked is for decoding. Ultimately we will compare all these versions and go with the most performant and flexible options. There are also multiple additional kernels in progress to accelerate the entire MoE layers. (sorting and permutation for tokens). More updates over the next two weeks!
Amazing work @lessw2020!
From their issues discussion it seems they are considering releasing but in the interim we can't use them for training yet.
It seems https://github.com/deepseek-ai/DeepGEMM/pull/95 just merged the training kernels
Thanks @vwxyzjn for the update! We have a cleaner version of deepseek now, so we can potentially integrate there, or just jump to mxfp8 directly.
Hi,
Is there any update on this?
hi @ajWithNucleus, I'm no longer working on Titan but maybe @tianyu-l or @danielvegamyhre can provide an update if any plans to integrate these training kernels.