cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline

Open bGOATnote opened this issue 2 months ago • 2 comments

Performance:

  • 52.1 TFLOPS on NVIDIA L4 (Ada, SM 8.9)
  • 1.74× faster than CUTLASS 4.3.0 baseline (~30 TFLOPS)
  • 63× faster than cuSPARSE (0.87 TFLOPS)
  • 83% efficiency vs dense cuBLAS (62.5 TFLOPS)

Technical approach:

  • WMMA tensor cores (16×16×16 FP16)
  • 2-stage pipeline with cp.async
  • Optimized tile sizes (BM=256, BN=128, BK=32)
  • Zero branch divergence (100% efficiency)
  • 99.22% of theoretical occupancy

Validation:

  • Full Nsight Compute profiling
  • 100-iteration benchmarks
  • Correctness verified vs cuSPARSE

Files:

  • examples/89_ada_sparse_bsr_gemm/89_ada_sparse_bsr_gemm.cu
  • examples/89_ada_sparse_bsr_gemm/CMakeLists.txt
  • examples/89_ada_sparse_bsr_gemm/README.md

Author: Brandon Dent, MD ([email protected]) License: BSD-3-Clause

bGOATnote avatar Nov 01 '25 04:11 bGOATnote

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Dec 01 '25 05:12 github-actions[bot]

Hi team — this PR is still active. I’m happy to make any modifications needed for merge. Please let me know if you'd like any specific benchmarking, correctness tests, or alignment with CUTLASS 4.3.0 conventions. This kernel has been validated on L4 (SM89) with Nsight Compute and cuSPARSE baselines. Thanks!

bGOATnote avatar Dec 01 '25 06:12 bGOATnote