[QST] The performance of Hopper group gemm is not meeting expectation in some cases
I ran the example 57_hopper_grouped_gemm with different options and found that the performance degrades when beta != 0.
For example, if you run the following command
./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=0
the runtime is 0.5ms.
Groups : 32
Avg runtime : 0.528832 ms
GFLOPS : 203040
However, when executing the command
./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=1
the runtime increases to 2.2ms.
Groups : 32
Avg runtime : 2.24327 ms
GFLOPS : 47865
If I am not mistaken,the increased time far exceeds the cost of loading the C tensor.
@ANIKET-SHIVAM
These are small-k cases. Current NoSmem epilogues are not optimized for that. We plan to have TMA based epilogue support soon for Grouped GEMM, that should improve these cases.