Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell)
Summary
Implements dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) using CUTLASS 3.x.
The dual-GEMM operation implemented is:
D0 = epilogue0(X @ B0, C0)
D1 = epilogue1(X @ B1, C1)
D2 = element_wise(D0, D1)
Implementation details
-
Based on the single-GEMM examples 48_hopper_warp_specialized_gemm.cu and 79a_blackwell_geforce_nvfp4_bf16_gemm.cu
-
B0andB1layouts are not decoupled, but both are passed separately to the builders for potential future flexibility. (Blackwell supports only TN layout; Hopper assumes NK layout for make_tma_copy_B_sm90 etc.) -
D2performsLeftSiLUAndMulsimilar to example 45_dual_gemm, implemented incollective/sm90_epilogue_tma_warpspecialized_dual.hppstore() -
D0andD1are intermediate results only and are not stored. -
Added
template<class Op0, class Op1>infusion/sm90_callbacks…to allow distinct operations forD0andD1.
Performance (keeping all configurations same as single-GEMM examples)
SM90 (Hopper)
- Problem size: 2048×2048×2048
- Rasterization: Heuristic with max CTA swizzle 2
- Avg runtime: 0.20429 ms
- GFLOPS: 168,191
- ≈5% faster than two single-GEMM baseline
SM120 (Blackwell)
- Problem size: 2048×2048×2048
- Avg runtime: 0.155648 ms
- GFLOPS: 220,753
- ≈30% slower than two single-GEMM baseline (haven’t been able to find the root cause yet)
Notes
- I am relatively new to CUTLASS C++; this work was implemented as a learning exercise. I followed example structure similar to
63_hopper_gemm_with_weight_prefetch. - The SM120 example was an initial local starting point and can be removed if unnecessary
Closes #1123
@hwu36 @mnicely Hi, just checking whether 3.x dual-gemm is still planned, and if there’s any chance this PR might get reviewed later if time allows? I’d appreciate any feedback on whether I’m on the right track. Thanks!
@ANIKET-SHIVAM , @IonThruster @depaulmillz could you please take a look?
This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.