cutlass [BUG] batch GEMM execution via cutlass

Describe the bug When using cutlass_profiler to profile batched GEMM operations, the command prints lots of code info and gives weird outputs.

Steps/Code to reproduce bug

build cutlass_profiler
execute the command: ./build/tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=batched --batch_count=512 --m=32 --n=32 --k=64
the terminal prints lots of code info like the following: root@alilab-sv02:/home/cutlass# ./build/tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=batched --batch_count=512 --m=32 --n=32 --k=64 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:386 GemmUniversal::can_implement() /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:456 returning kSuccess /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:283 GemmUniversalBase::initialize() - workspace 0, stream: null /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:288 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update()

Jun 07 '22 02:06 TUMSchieben

What is your cmake command?

Jun 07 '22 02:06 hwu36

I just followed the README guide: cmake .. -DCUTLASS_NVCC_ARCHS=80 make cutlass_profiler -j16

Jun 07 '22 03:06 TUMSchieben

What type of batch are you interested in? Data types, layouts, architectures, etc.

Jun 07 '22 03:06 hwu36

What type of batch are you interested in? Data types, layouts, architectures, etc.

I'd like to know the performance of batch matmul ops used in typical transformer models, and target GPU is A100. W.r.t data types and layouts, I think fp32 (or using TensorCore) and normal NCHW or NHWC is preferred. E.g. some shapes are as follows: b | m | n | k 512 | 32 | 32 | 64 512 | 32 | 64 | 32 512 | 36 | 36 | 64 512 | 36 | 64 | 36 512 | 36 | 32 | 64 512 | 36 | 64 | 32

Jun 07 '22 05:06 TUMSchieben

normal NCHW or NHWC is preferred

Gemm works on 2D data. Do you want row major or column major for each A, B, C in C = A x B ?

Jun 07 '22 05:06 hwu36

BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.

Jun 07 '22 05:06 hwu36

2D data.

Sorry for the error, I mean normal layout is OK. It's not limited for row major or column major. I'm just curious about the peak performance that cutlass is able to achieve for this type of op.

Jun 07 '22 05:06 TUMSchieben

BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.

OK, I will check this. Btw this bug is still there and I think it should be somehow fixed :)

Jun 07 '22 05:06 TUMSchieben

BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.

examples/24_gemm_grouped executed failed, and I got the following error:

/home/cutlass# ./build/examples/24_gemm_grouped/24_gemm_grouped --groups=512 --m=36 --n=36 --k=64
/home/cutlass/include/cutlass/gemm/device/gemm_grouped.h:211 GemmUniversalBase::initialize() - workspace 0, stream: null Kernel execution error: misaligned addressProfiling CUTLASS grouped GEMM has failed.

Failed

Jun 07 '22 05:06 TUMSchieben

36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can set m to be 32 to run the example.

Jun 07 '22 05:06 hwu36

36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can set m to be 32 to run the example.

I changed the example code as follows:

diff --git a/examples/24_gemm_grouped/gemm_grouped.cu b/examples/24_gemm_grouped/gemm_grouped.cu
index cfeb1ba..123a152 100644
--- a/examples/24_gemm_grouped/gemm_grouped.cu
+++ b/examples/24_gemm_grouped/gemm_grouped.cu
@@ -162,7 +162,7 @@ struct Options {
   Options():
     help(false),
     error(false),
-    alignment(8),
+    alignment(4),
     reference_check(true),
     problem_count(15),
     iterations(20),
@@ -181,7 +181,7 @@ struct Options {
       return;
     }
 
-    cmd.get_cmd_line_argument("alignment", alignment, 8);
+    cmd.get_cmd_line_argument("alignment", alignment, 4);
     cmd.get_cmd_line_argument("groups", problem_count, 15);
     cmd.get_cmd_line_argument("alpha", alpha, 1.0f);
     cmd.get_cmd_line_argument("beta", beta, 0.0f);

But it does not work, is there anything else that should be modified?

Jun 07 '22 06:06 TUMSchieben

alignmentA is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L73 alignmentB is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L81 alignmentC is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L58

Both A and C needs to be change to 4 if M is 36 and layouts are col x col -> col

Jun 07 '22 06:06 hwu36

Back to your original batch gemm profiling problem

you can use this cmake command

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sgemm

to just run fp32 gemm on A100.

Jun 07 '22 06:06 hwu36

@TUMSchieben were you able to work past your issues?

Jun 15 '22 17:06 mnicely

The problems are resolved after updating the repo to the latest commit.

Jun 16 '22 02:06 TUMSchieben

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 14 '22 15:08 github-actions[bot]

[BUG] batch GEMM execution via cutlass_profiler gives weird outputs