cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] What is the definition and difference of two partition methods in cute?

Open ziyuhuang123 opened this issue 1 year ago • 6 comments

  TiledMMA tiled_mma;
  auto thr_mma = tiled_mma.get_slice(threadIdx.x);
  auto tAgA = thr_mma.partition_A(gA);  // (MMA, MMA_M, MMA_K, num_tile_k)
  auto tBgB = thr_mma.partition_B(gB);  // (MMA, MMA_N, MMA_K, num_tile_k)
  auto tCgC = thr_mma.partition_C(gC);  // (MMA, MMA_M, MMA_N)

  auto tArA = thr_mma.partition_fragment_A(gA(_, _, 0));  // (MMA, MMA_M, MMA_K)
  auto tBrB = thr_mma.partition_fragment_B(gB(_, _, 0));  // (MMA, MMA_N, MMA_K)
  auto tCrC = thr_mma.partition_fragment_C(gC(_, _));     // (MMA, MMA_M, MMA_N)
  auto tCsB = local_partition(sB, tC, threadIdx.x, Step< X,_1>{});   // (THR_N,BLK_K)
  // Partition gC (M,N) by the tile of tC
  auto tCgC = local_partition(gC, tC, threadIdx.x, Step<_1,_1>{});   // (THR_M,THR_N)

I can not find detailed doc for these two partition methods, could anyone provide usage description and difference for them? Like, I do not understand how data are allocated across threadIdx.x, if I have .y, how can I input them? How is Step works?

ziyuhuang123 avatar Dec 06 '23 15:12 ziyuhuang123

Like the example here, what does the output mean?

// Tile a tensor according to the flat shape of a layout that provides the coordinate of the target index.
// This is typical at the Thread level where data is partitioned across repeated patterns of threads:
//   Tensor data = ...                                                            // (_16,_64)
//   Tensor thr_data = local_partition(data, Layout<Shape<_2,_16>>{}, thr_idx);   // ( _8, _4)

The shape of each partition is <2, 16>, seems the (8, 4) here is, we have 8*4 number of <2, 16>?

ziyuhuang123 avatar Dec 06 '23 16:12 ziyuhuang123

So I print(A.shape), I still get a value, but actually it is different from normal tensor definition. So why here it is still a "tensor" object??? Confusing!

ziyuhuang123 avatar Dec 06 '23 16:12 ziyuhuang123

It seems that step(_1, X) mean, for first dimension, divide as normal, for the second dimension, do not divide.

ziyuhuang123 avatar Dec 06 '23 16:12 ziyuhuang123

Also, for detailed local_partition, how it divide the data? Do we have bank conflict(yes, we will have), and how can we avoid it?

ziyuhuang123 avatar Dec 06 '23 16:12 ziyuhuang123

partition_A give you a view of the subarray that this thread is responsible for, partition_fragment_A convert that view to have a compact layout and the fragment element type required by the tensor cores, and then create an array (fragment) that lives in the registers for the original elements from A, B and C to be loaded in to.

You typically don't need threadIdx.y.

YichengDWu avatar Dec 07 '23 06:12 YichengDWu

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Feb 02 '24 16:02 github-actions[bot]

Closing due to inactivity

mnicely avatar Feb 22 '24 15:02 mnicely