cutlass [QST] The performance of Hopper group gemm is not meeting expectation in some cases

I ran the example 57_hopper_grouped_gemm with different options and found that the performance degrades when beta != 0.

For example, if you run the following command ./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=0 the runtime is 0.5ms.

  Groups      : 32
  Avg runtime : 0.528832 ms
  GFLOPS      : 203040

However, when executing the command ./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=1 the runtime increases to 2.2ms.

  Groups      : 32
  Avg runtime : 2.24327 ms
  GFLOPS      : 47865

If I am not mistaken，the increased time far exceeds the cost of loading the C tensor.

Feb 18 '24 09:02 AndySong20

@ANIKET-SHIVAM

Feb 18 '24 16:02 hwu36

These are small-k cases. Current NoSmem epilogues are not optimized for that. We plan to have TMA based epilogue support soon for Grouped GEMM, that should improve these cases.

Feb 21 '24 17:02 ANIKET-SHIVAM