cutlass [QST] Is there any other legal layout in cutlass?

I see this in example code: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu So I wonder is there any other legal layout?

// Define block sizes (static) auto bM = Int<128>{}; auto bN = Int<128>{}; auto bK = Int< 8>{};

// Define the block layouts (static) auto sA = make_layout(make_shape(bM,bK)); auto sB = make_layout(make_shape(bN,bK)); auto sC = make_layout(make_shape(bM,bN));

// Define the thread layouts (static) auto tA = make_layout(make_shape(Int<32>{}, Int< 8>{})); auto tB = make_layout(make_shape(Int<32>{}, Int< 8>{})); auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));

Feb 04 '24 08:02 ziyuhuang123

All layout that satisfy the preconditions of the kernel are accepted and should produce a correct kernel

  // Preconditions
  CUTE_STATIC_ASSERT(is_static<ABlockLayout>::value);
  CUTE_STATIC_ASSERT(is_static<BBlockLayout>::value);
  CUTE_STATIC_ASSERT(is_static<CBlockLayout>::value);

  CUTE_STATIC_ASSERT(is_static<AThreadLayout>::value);
  CUTE_STATIC_ASSERT(is_static<BThreadLayout>::value);
  CUTE_STATIC_ASSERT(is_static<CThreadLayout>::value);

  CUTE_STATIC_ASSERT_V(size(tA) == size(tC));
  CUTE_STATIC_ASSERT_V(size(tB) == size(tC));

  CUTE_STATIC_ASSERT_V(shape<0>(blockA) == shape<0>(blockC));        // BLK_M
  CUTE_STATIC_ASSERT_V(shape<0>(blockB) == shape<1>(blockC));        // BLK_N
  CUTE_STATIC_ASSERT_V(shape<1>(blockA) == shape<1>(blockB));        // BLK_K

The example is for a NT gemm (m-major A and n-major B), but it will also immediately function correctly for TN input data:

  // Define shapes (dynamic)
  auto M = int(m);
  auto N = int(n);
  auto K = int(k);

  // Define strides (mixed)
  auto dA = make_stride(ldA, Int<1>{});    // k-major A
  auto dB = make_stride(ldB, Int<1>{});    // k-major B
  auto dC = make_stride(Int<1>{}, ldC);    // m-major C

The GEMM will function correctly, but it will not be efficient because the smem layouts and the thread layouts that are used to read the data can be improved to better align with hardware. Thus, we can optimize the gemm by using k-major smem and k-major thread layouts to produce better partitioning patterns and more coalesced loads from gmem:

  // Define the block layouts (static)
  auto sA = make_layout(make_shape(bM,bK), make_stride(bK, Int<1>{});  // k-major smem
  auto sB = make_layout(make_shape(bN,bK), make_stride(bK, Int<1>{});  // k-major smem
  auto sC = make_layout(make_shape(bM,bN));

  // Define the thread layouts (static)
  auto tA = make_layout(make_shape(Int< 8>{}, Int<32>{}), LayoutRight{});  // k-major  8x32 thr layout
  auto tB = make_layout(make_shape(Int< 8>{}, Int<32>{}), LayoutRight{});  // k-major  8x32 thr layout
  auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));                 // m-major 16x16 thr layout

Of course, these can all be mixed and matched to suit your application.

Even more advanced layouts of input data can be used to implement general tensor-tensor contractions with this same kernel rather than simply GEMMs. You can find existing examples of these in CUTLASS already, but I'll be updating the CuTe documentation and examples with more efficient and general practices shortly.

Feb 04 '24 21:02 ccecka

@ziyuhuang123 is your question answered?

Feb 26 '24 17:02 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 27 '24 18:03 github-actions[bot]

cutlass cutlass copied to clipboard

[QST] Is there any other legal layout in cutlass?

cutlass
cutlass copied to clipboard