Performance Parity for H100_mma_ABt and H100_mma Kernels

Open SohamGovande opened this issue 10 months ago • 0 comments

Summary

This 4-line code change achieves performance parity between the transposed (H100_mma_ABt) and non-transposed (H100_mma) matmul kernels by dispatching the largest available tensor core instruction (wgmma of size 64x16x256). Previously, the transposed kernel was approximately 65 TFLOPS slower than its non-transposed counterpart.

Changes

Changed wgmma instruction size from 64x16x64 to 64x16x256.
Fixed strides for the column-major B tensor. Ensures correctness for cases where N != K, resolving previous correctness check failures.

Benchmark Changes

Updated benchmark dimensions from square (N=4096) to rectangular (M=2048, N=4096, K=8192) to showcase and validate performance improvements and correctness for non-square inputs.

Testing

Verified correctness and performance improvements through internal benchmarks. Confirmed stable results and parity with H100_mma.

Mar 05 '25 18:03 SohamGovande