cutlass
cutlass copied to clipboard
[QST] What is the definition and difference of two partition methods in cute?
TiledMMA tiled_mma;
auto thr_mma = tiled_mma.get_slice(threadIdx.x);
auto tAgA = thr_mma.partition_A(gA); // (MMA, MMA_M, MMA_K, num_tile_k)
auto tBgB = thr_mma.partition_B(gB); // (MMA, MMA_N, MMA_K, num_tile_k)
auto tCgC = thr_mma.partition_C(gC); // (MMA, MMA_M, MMA_N)
auto tArA = thr_mma.partition_fragment_A(gA(_, _, 0)); // (MMA, MMA_M, MMA_K)
auto tBrB = thr_mma.partition_fragment_B(gB(_, _, 0)); // (MMA, MMA_N, MMA_K)
auto tCrC = thr_mma.partition_fragment_C(gC(_, _)); // (MMA, MMA_M, MMA_N)
auto tCsB = local_partition(sB, tC, threadIdx.x, Step< X,_1>{}); // (THR_N,BLK_K)
// Partition gC (M,N) by the tile of tC
auto tCgC = local_partition(gC, tC, threadIdx.x, Step<_1,_1>{}); // (THR_M,THR_N)
I can not find detailed doc for these two partition methods, could anyone provide usage description and difference for them? Like, I do not understand how data are allocated across threadIdx.x, if I have .y, how can I input them? How is Step works?
Like the example here, what does the output mean?
// Tile a tensor according to the flat shape of a layout that provides the coordinate of the target index.
// This is typical at the Thread level where data is partitioned across repeated patterns of threads:
// Tensor data = ... // (_16,_64)
// Tensor thr_data = local_partition(data, Layout<Shape<_2,_16>>{}, thr_idx); // ( _8, _4)
The shape of each partition is <2, 16>, seems the (8, 4) here is, we have 8*4 number of <2, 16>?
So I print(A.shape), I still get a value, but actually it is different from normal tensor definition. So why here it is still a "tensor" object??? Confusing!
It seems that step(_1, X) mean, for first dimension, divide as normal, for the second dimension, do not divide.
Also, for detailed local_partition, how it divide the data? Do we have bank conflict(yes, we will have), and how can we avoid it?
partition_A
give you a view of the subarray that this thread is responsible for, partition_fragment_A
convert that view to have a compact layout and the fragment element type required by the tensor cores, and then create an array (fragment) that lives in the registers for the original elements from A, B and C to be loaded in to.
You typically don't need threadIdx.y
.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Closing due to inactivity