ThunderKittens
ThunderKittens copied to clipboard
Performance Parity for H100_mma_ABt and H100_mma Kernels
Summary
This 4-line code change achieves performance parity between the transposed (H100_mma_ABt) and non-transposed (H100_mma) matmul kernels by dispatching the largest available tensor core instruction (wgmma of size 64x16x256). Previously, the transposed kernel was approximately 65 TFLOPS slower than its non-transposed counterpart.
Changes
- Changed wgmma instruction size from
64x16x64to64x16x256. - Fixed strides for the column-major
Btensor. Ensures correctness for cases whereN != K, resolving previous correctness check failures.
Benchmark Changes
- Updated benchmark dimensions from square (
N=4096) to rectangular (M=2048, N=4096, K=8192) to showcase and validate performance improvements and correctness for non-square inputs.
Testing
Verified correctness and performance improvements through internal benchmarks. Confirmed stable results and parity with H100_mma.