cutlass
cutlass copied to clipboard
[QST] Is there any other legal layout in cutlass?
I see this in example code: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu So I wonder is there any other legal layout?
// Define block sizes (static) auto bM = Int<128>{}; auto bN = Int<128>{}; auto bK = Int< 8>{};
// Define the block layouts (static) auto sA = make_layout(make_shape(bM,bK)); auto sB = make_layout(make_shape(bN,bK)); auto sC = make_layout(make_shape(bM,bN));
// Define the thread layouts (static) auto tA = make_layout(make_shape(Int<32>{}, Int< 8>{})); auto tB = make_layout(make_shape(Int<32>{}, Int< 8>{})); auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));
All layout that satisfy the preconditions of the kernel are accepted and should produce a correct kernel
// Preconditions
CUTE_STATIC_ASSERT(is_static<ABlockLayout>::value);
CUTE_STATIC_ASSERT(is_static<BBlockLayout>::value);
CUTE_STATIC_ASSERT(is_static<CBlockLayout>::value);
CUTE_STATIC_ASSERT(is_static<AThreadLayout>::value);
CUTE_STATIC_ASSERT(is_static<BThreadLayout>::value);
CUTE_STATIC_ASSERT(is_static<CThreadLayout>::value);
CUTE_STATIC_ASSERT_V(size(tA) == size(tC));
CUTE_STATIC_ASSERT_V(size(tB) == size(tC));
CUTE_STATIC_ASSERT_V(shape<0>(blockA) == shape<0>(blockC)); // BLK_M
CUTE_STATIC_ASSERT_V(shape<0>(blockB) == shape<1>(blockC)); // BLK_N
CUTE_STATIC_ASSERT_V(shape<1>(blockA) == shape<1>(blockB)); // BLK_K
The example is for a NT gemm (m-major A and n-major B), but it will also immediately function correctly for TN input data:
// Define shapes (dynamic)
auto M = int(m);
auto N = int(n);
auto K = int(k);
// Define strides (mixed)
auto dA = make_stride(ldA, Int<1>{}); // k-major A
auto dB = make_stride(ldB, Int<1>{}); // k-major B
auto dC = make_stride(Int<1>{}, ldC); // m-major C
The GEMM will function correctly, but it will not be efficient because the smem layouts and the thread layouts that are used to read the data can be improved to better align with hardware. Thus, we can optimize the gemm by using k-major smem and k-major thread layouts to produce better partitioning patterns and more coalesced loads from gmem:
// Define the block layouts (static)
auto sA = make_layout(make_shape(bM,bK), make_stride(bK, Int<1>{}); // k-major smem
auto sB = make_layout(make_shape(bN,bK), make_stride(bK, Int<1>{}); // k-major smem
auto sC = make_layout(make_shape(bM,bN));
// Define the thread layouts (static)
auto tA = make_layout(make_shape(Int< 8>{}, Int<32>{}), LayoutRight{}); // k-major 8x32 thr layout
auto tB = make_layout(make_shape(Int< 8>{}, Int<32>{}), LayoutRight{}); // k-major 8x32 thr layout
auto tC = make_layout(make_shape(Int<16>{}, Int<16>{})); // m-major 16x16 thr layout
Of course, these can all be mixed and matched to suit your application.
Even more advanced layouts of input data can be used to implement general tensor-tensor contractions with this same kernel rather than simply GEMMs. You can find existing examples of these in CUTLASS already, but I'll be updating the CuTe documentation and examples with more efficient and general practices shortly.
@ziyuhuang123 is your question answered?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.