cutlass [QST]Legal threadblock, warp, mma shape

What is your question? I am trying to use cutlass on Ampere architecture to multiply two rectangular matrix MxK and KxN where M and N are small (say 16) and K is very large (say 16777216). I am using as starting point the examples including the cutlass distribution, in particular the splitk_gemm.cu (example 060. The thread. warp, and MMA shape are originally defined as follows

// This code section describes the tile size a thread block will compute using ShapeMMAThreadBlock = cutlass::gemm::GemmShape<128, 128, 32>; / // This code section describes tile size a warp will compute using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 32>; // This code section describes the size of MMA op using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;

If I understand correctly, this is not optimal for my case, so I tried to modify the thread and warp shapes to

using ShapeMMAThreadBlock = cutlass::gemm::GemmShape<64, 64, 16>;
// This code section describes tile size a warp will compute using ShapeMMAWarp = cutlass::gemm::GemmShape<32, 32, 16>; // This code section describes the size of MMA op using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;

This compiles. However any other combination I try, fails at compilation. For instance using ShapeMMAThreadBlock = cutlass::gemm::GemmShape<32, 32, 16>;
// This code section describes tile size a warp will compute using ShapeMMAWarp = cutlass::gemm::GemmShape<16, 16, 16>;
// This code section describes the size of MMA op using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;

fails with errors like "ShapeInAccesses must be divisible by WarpThreadArrangement." Number of iterations must be non-zero Shape must be a multiple of InterleavedTileShape.

Is there anywhere a set of rules that one can apply to find which shapes are acceptable? Thanks in advance for any help! Massimo

Oct 26 '25 17:10 mbernaschi

I found that in the include file cutlass/gemm/device/default_gemm_configuration.h there are a set of possible configurations, including the configuration that I mentioned in my original post, that is

using ShapeMMAThreadBlock = cutlass::gemm::GemmShape<64, 64, 16>; // This code section describes tile size a warp will compute using ShapeMMAWarp = cutlass::gemm::GemmShape<32, 32, 16>; // This code section describes the size of MMA op using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;

I am under the impression that ANY configuration that is not in the cutlass/gemm/device/default_gemm_configuration.h fails at compilation time. Can, someone, please confirm or, as I already asked, give me a pointer to the set of rules that need to be considered to define an "acceptable" configuration? Thanks in advance! Massimo

Oct 28 '25 09:10 mbernaschi

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Nov 27 '25 10:11 github-actions[bot]