FBGEMM Gate invalid triton autotune configs in AOTInductor for GFX95+

Summary: Saw lowering error when lowering models on MI350X with FP8 PyTorch: P1966277532

Issue arises from lack of instruction support for BLOCK_K <= 64 when matrix_instr_nonkdim=16 on GFX95+ Hardware. This was previously patched for FP8 Triton in D81180838, but now error is showing up in AOTI codepaths with FP8 PyTorch.

Differential Revision: D83383625

Sep 26 '25 23:09 JChunX

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	66d3d30b9b65aacd0cad80d894f087da6c32daa9
Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68d7264a1236520008f4811f
Deploy Preview	https://deploy-preview-4940--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sep 26 '25 23:09 netlify[bot]

@JChunX has exported this pull request. If you are a Meta employee, you can view the originating diff in D83383625.

Sep 26 '25 23:09 facebook-github-bot

Gate invalid triton autotune configs in AOTInductor for GFX95+

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Deploy Preview for pytorch-fbgemm-docs ready!