[BUG] batch GEMM execution via cutlass_profiler gives weird outputs
Describe the bug When using cutlass_profiler to profile batched GEMM operations, the command prints lots of code info and gives weird outputs.
Steps/Code to reproduce bug
- build cutlass_profiler
- execute the command: ./build/tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=batched --batch_count=512 --m=32 --n=32 --k=64
- the terminal prints lots of code info like the following: root@alilab-sv02:/home/cutlass# ./build/tools/profiler/cutlass_profiler --operation=Gemm --gemm_kind=batched --batch_count=512 --m=32 --n=32 --k=64 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:386 GemmUniversal::can_implement() /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:456 returning kSuccess /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:283 GemmUniversalBase::initialize() - workspace 0, stream: null /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:288 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:358 GemmUniversalBase::run() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:375 grid: (1, 1, 512), block: (256, 1, 1), SMEM: 49152 bytes /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:343 GemmUniversalBase()::update() - workspace: 0 /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:157 GemmUniversalBase::get_workspace_size() /home/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:182 workspace_bytes: 0 /home/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:363 GemmUniversal::Params::update()
What is your cmake command?
I just followed the README guide: cmake .. -DCUTLASS_NVCC_ARCHS=80 make cutlass_profiler -j16
What type of batch are you interested in? Data types, layouts, architectures, etc.
What type of batch are you interested in? Data types, layouts, architectures, etc.
I'd like to know the performance of batch matmul ops used in typical transformer models, and target GPU is A100. W.r.t data types and layouts, I think fp32 (or using TensorCore) and normal NCHW or NHWC is preferred. E.g. some shapes are as follows: b | m | n | k 512 | 32 | 32 | 64 512 | 32 | 64 | 32 512 | 36 | 36 | 64 512 | 36 | 64 | 36 512 | 36 | 32 | 64 512 | 36 | 64 | 32
normal NCHW or NHWC is preferred
Gemm works on 2D data. Do you want row major or column major for each A, B, C in C = A x B ?
BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.
2D data.
Sorry for the error, I mean normal layout is OK. It's not limited for row major or column major. I'm just curious about the peak performance that cutlass is able to achieve for this type of op.
BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.
OK, I will check this. Btw this bug is still there and I think it should be somehow fixed :)
BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the profilre. You need to use that example to profile.
examples/24_gemm_grouped executed failed, and I got the following error:
/home/cutlass# ./build/examples/24_gemm_grouped/24_gemm_grouped --groups=512 --m=36 --n=36 --k=64
/home/cutlass/include/cutlass/gemm/device/gemm_grouped.h:211 GemmUniversalBase::initialize() - workspace 0, stream: null
Kernel execution error: misaligned addressProfiling CUTLASS grouped GEMM has failed.
Failed
36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can set m to be 32 to run the example.
36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can set m to be
32to run the example.
I changed the example code as follows:
diff --git a/examples/24_gemm_grouped/gemm_grouped.cu b/examples/24_gemm_grouped/gemm_grouped.cu
index cfeb1ba..123a152 100644
--- a/examples/24_gemm_grouped/gemm_grouped.cu
+++ b/examples/24_gemm_grouped/gemm_grouped.cu
@@ -162,7 +162,7 @@ struct Options {
Options():
help(false),
error(false),
- alignment(8),
+ alignment(4),
reference_check(true),
problem_count(15),
iterations(20),
@@ -181,7 +181,7 @@ struct Options {
return;
}
- cmd.get_cmd_line_argument("alignment", alignment, 8);
+ cmd.get_cmd_line_argument("alignment", alignment, 4);
cmd.get_cmd_line_argument("groups", problem_count, 15);
cmd.get_cmd_line_argument("alpha", alpha, 1.0f);
cmd.get_cmd_line_argument("beta", beta, 0.0f);
But it does not work, is there anything else that should be modified?
alignmentA is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L73 alignmentB is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L81 alignmentC is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L58
Both A and C needs to be change to 4 if M is 36 and layouts are col x col -> col
Back to your original batch gemm profiling problem
you can use this cmake command
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sgemm
to just run fp32 gemm on A100.
@TUMSchieben were you able to work past your issues?
The problems are resolved after updating the repo to the latest commit.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.