New DeepGemm Style Groupwise Kernel
Summary: Initial enablement of CUTLASS' new groupwise scaling API for FP8 GEMM. This diff adds all the needed scaffolding and we confirm that the kernel runs and produces correct outputs, but I do not yet include tuning that would yield better performance. Interestingly, CUTLASS wants group/block scales in MN major format, while every other groupwise implementation I've seen uses K major. I add an option to our triton blockwise quantization kernels to support this layout.
In benchmarking the performance of those quantization kernels, I see that trition blockwise in general (with or without K major output) is quite slow. We may need to iterate on that if this becomes a commonly used kernel.
Differential Revision: D76830629
Deploy Preview for pytorch-fbgemm-docs ready!
| Name | Link |
|---|---|
| Latest commit | 47c135d23052c82fdbe7c06c1533f98925a1586f |
| Latest deploy log | https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/686ff413a14fd1000848503d |
| Deploy Preview | https://deploy-preview-4365--pytorch-fbgemm-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify project configuration.
This pull request was exported from Phabricator. Differential Revision: D76830629
This pull request was exported from Phabricator. Differential Revision: D76830629
This pull request was exported from Phabricator. Differential Revision: D76830629
This pull request has been merged in pytorch/FBGEMM@6bdbc78f361acdcd5467cfdb78fdb1b8588552b8.