cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] The performance of Hopper group gemm is not meeting expectation in some cases

Open AndySong20 opened this issue 1 year ago • 1 comments

I ran the example 57_hopper_grouped_gemm with different options and found that the performance degrades when beta != 0.

For example, if you run the following command ./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=0 the runtime is 0.5ms.

  Groups      : 32
  Avg runtime : 0.528832 ms
  GFLOPS      : 203040

However, when executing the command ./examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm --m=5120 --n=1280 --k=256 --groups=32 --beta=1 the runtime increases to 2.2ms.

  Groups      : 32
  Avg runtime : 2.24327 ms
  GFLOPS      : 47865

If I am not mistaken,the increased time far exceeds the cost of loading the C tensor.

AndySong20 avatar Feb 18 '24 09:02 AndySong20

@ANIKET-SHIVAM

hwu36 avatar Feb 18 '24 16:02 hwu36

These are small-k cases. Current NoSmem epilogues are not optimized for that. We plan to have TMA based epilogue support soon for Grouped GEMM, that should improve these cases.

ANIKET-SHIVAM avatar Feb 21 '24 17:02 ANIKET-SHIVAM