FBGEMM New DeepGemm Style Groupwise Kernel

Summary: Initial enablement of CUTLASS' new groupwise scaling API for FP8 GEMM. This diff adds all the needed scaffolding and we confirm that the kernel runs and produces correct outputs, but I do not yet include tuning that would yield better performance. Interestingly, CUTLASS wants group/block scales in MN major format, while every other groupwise implementation I've seen uses K major. I add an option to our triton blockwise quantization kernels to support this layout.

In benchmarking the performance of those quantization kernels, I see that trition blockwise in general (with or without K major output) is quite slow. We may need to iterate on that if this becomes a commonly used kernel.

Differential Revision: D76830629

Jun 17 '25 17:06 jwfromm

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	47c135d23052c82fdbe7c06c1533f98925a1586f
Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/686ff413a14fd1000848503d
Deploy Preview	https://deploy-preview-4365--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Jun 17 '25 17:06 netlify[bot]

This pull request was exported from Phabricator. Differential Revision: D76830629

Jun 17 '25 17:06 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D76830629

Jul 09 '25 21:07 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D76830629

Jul 10 '25 17:07 facebook-github-bot

This pull request has been merged in pytorch/FBGEMM@6bdbc78f361acdcd5467cfdb78fdb1b8588552b8.

Jul 10 '25 22:07 facebook-github-bot

New DeepGemm Style Groupwise Kernel

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Deploy Preview for pytorch-fbgemm-docs ready!