cutlass
cutlass copied to clipboard
[BUG] make_tiled_copy should not assume 2d data, Thr and Val layouts
Describe the bug
make_tiled_copy also should not secretly pad Thr and Val. See code sample and discussion.
Steps/Code to reproduce bug
#include <cute/tensor.hpp>
using namespace cute;
int main() {
std::vector<int> tensor_buffer(12);
for (int i = 0; i < 12; i++) {
tensor_buffer[i] = i;
}
{
print("====================================== case 1 ======================================\n");
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{})));
Layout<Shape<_2>, Stride<_1>> thr;
Layout<Shape<_2>, Stride<_1>> val;
auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
auto thr_copy = tiled_copy.get_slice(1);
// auto view = thr_copy.partition_S(t); // static_assert(R >= rank_v<TiledShape_MN>, "Rank of tensor to be partitioned too small.");
// print_tensor(view);
}
{
print("====================================== case 2 ======================================\n");
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_1{}, _12{})));
Layout<Shape<_1, _2>, Stride<_2, _1>> thr;
Layout<Shape<_1, _2>, Stride<_2, _1>> val;
auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
auto thr_copy = tiled_copy.get_slice(1);
auto view = thr_copy.partition_S(t);
print_tensor(t);
// ptr[32b](0x55adbfc83eb0) o (_1,_12):(_0,_1):
// 0 1 2 3 4 5 6 7 8 9 10 11
print_tensor(view);
// ptr[32b](0x55adbfc83eb8) o ((_1,_2),_1,_3):((_0,_1),_0,_4):
// ptr[32b](0x55adbfc83eb8) o ((_1,_2),_1):((_0,_1),_0):
// 2
// 3
// -----
// ptr[32b](0x55adbfc83ec8) o ((_1,_2),_1):((_0,_1),_0):
// 6
// 7
// -----
// ptr[32b](0x55adbfc83ed8) o ((_1,_2),_1):((_0,_1),_0):
// 10
// 11
}
{
print("====================================== case 3 ======================================\n");
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{}, _1{})));
Layout<Shape<_2>, Stride<_1>> thr;
Layout<Shape<_2>, Stride<_1>> val;
auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, int>{}, thr, val);
auto thr_copy = tiled_copy.get_slice(1);
auto view = thr_copy.partition_S(t);
print(append<2>(thr, Layout<_1>{}));
// (_2,_1):(_1,_0)
print_tensor(t);
// ptr[32b](0x556b715e0eb0) o (_12,_1):(_1,_0):
// 0
// 1
// 2
// 3
// 4
// 5
// 6
// 7
// 8
// 9
// 10
// 11
print_tensor(view);
// ptr[32b](0x556b715e0eb0) o ((_1,_2),_6,_1):((_0,_1),_2,_0):
// ptr[32b](0x556b715e0eb0) o ((_1,_2),_6):((_0,_1),_2):
// 0 2 4 6 8 10
// 1 3 5 7 9 11
}
{
print("====================================== case 4 ======================================\n");
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{}, _1{})));
Layout<Shape<_2, _1>, Stride<_1, _2>> thr;
Layout<Shape<_2, _1>, Stride<_1, _2>> val;
auto tiled_copy = make_tiled_copy(Copy_Atom<DefaultCopy, float>{}, thr, val);
auto thr_copy = tiled_copy.get_slice(1);
auto view = thr_copy.partition_S(t);
print_tensor(t);
// same as case 3
print_tensor(view);
// ptr[32b](0x55b4e4864eb8) o ((_1,_2),_3,_1):((_0,_1),_4,_0):
// ptr[32b](0x55b4e4864eb8) o ((_1,_2),_3):((_0,_1),_4):
// 2 6 10
// 3 7 11
}
return 0;
}
Expected behavior
Note, the tensor is 1d for some reason, threads access data in interleaved style, in multiple iters.
case 1
Case 1 is a perfect use case of "compose then slice" of 1d Tensor, 1d Thr and 1d Val all along the same direction. The hidden append in impl prevents make_tiled_copy being useful in this case.
auto thr_layout_mn = append<R>(thr_layout, Layout<_1>{});
auto val_layout_mn = append<R>(val_layout, Layout<_1>{});
case 2
Case 2 is what we want and it works as expected. It works because we make everything explict and correctly express the ording.
case 3
case 3 is extremely weird and confusing due to the append<R>, it produces (_2,_1):(_1,_0) from _2:_1. It produces incomprehensible iteraion mode. This should not happen...
case 4
case 4 is what case 3 should have been. It is a manually fixed version of case 3.
Environment details (please complete the following information): a75b4ac483166189a45290783cb0a18af5ff0ea5
@ccecka
In summary, the secret behaviour limit the usage of make_tiled_copy, in some edge cases, it produce incomprehensible and unexpected "incorrect" result.
The problem appears to be with blocked_product and raked_product rather than the append.
raked_product(Layout<Shape<_2>>{}, Layout<Shape<_2>>{})
and
raked_product(Layout<_2>{}, Layout<_2>{})
should be explicitly rank-1 rather than rank-2 (and implicitly intended to be interpreted as 1-D). Both blocked_product and raked_product are rather unique because they are rank-sensitive and the rank of the result should be equal to the maximum of the rank of the inputs. That would fix the confusion cases.
I'll include this in my documentation+pre/postconditions update coming shortly.
You can use
print(tiled_copy);
to get more information and potentially use make_tiled_copy_impl to specify the Tiler and TV_Layout directly.
I happen to create my own utility function:
#include "cute/tensor.hpp"
using namespace cute;
template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor&& tensor, const ThrLayout& thr_layout = {}, const ValLayout& val_layout = {}) {
auto layout = raked_product(thr_layout, val_layout);
auto with_iter = zipped_divide(tensor.layout(), layout);
auto layout_per_iter = get<0>(with_iter);
auto iter_layout = get<1>(with_iter);
auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
);
return std::forward<Tensor>(tensor).compose(layout_tv_iter);
}
int main() {
std::vector<int> tensor_buffer(12);
for (int i = 0; i < 12; i++) {
tensor_buffer[i] = i;
}
{
print("====================================== case 5 ======================================\n");
// auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_4{}, _3{})));
auto t = make_tensor(tensor_buffer.data(), make_layout(make_shape(_12{})));
Layout<Shape<_2>, Stride<_1>> thr;
Layout<Shape<_2>, Stride<_1>> val;
print_tensor(make_tv_view(t, thr, val)(make_coord(0, _), _));
}
}
cute should somehow provide a utility function like this.
If it works for you, that's great. I'll take a closer look soon.
Until then, it appears it can be simplified to
template <class ThrLayout, class ValLayout, class Tensor>
auto make_tv_view(Tensor && tensor,
ThrLayout const& thr_layout = {}, // (m,n,...) -> thr_idx
ValLayout const& val_layout = {}) // (m,n,...) -> val_idx
{
auto layout_mn = raked_product(thr_layout, val_layout);
auto layout_tv = right_inverse(layout_mn).with_shape(make_shape(size(thr_layout), size(val_layout)));
return zipped_divide(tensor, layout_mn).compose(layout_tv, _);
}
Great to know that I can use underscore in compose.
EDIT: removed unnecessary compose
@ccecka The .compose(layout_tv, _) does not work for 3d case. More work for you now =).
#include <cute/tensor.hpp>
using namespace cute;
template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout(
const Layout& target, // the target layout we want to compose with
const ThrLayout& thr_layout = {}, // (t0,...) -> thr_idx
const ValLayout& val_layout = {} // (v0,...) -> val_idx
) {
auto layout = raked_product(thr_layout, val_layout);
auto with_iter = zipped_divide(target, layout);
auto layout_per_iter = get<0>(with_iter);
auto iter_layout = get<1>(with_iter);
auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
);
return layout_tv_iter;
}
template <typename ThrLayout, typename ValLayout, typename Layout>
__host__ __device__ constexpr auto make_tv_layout_ccecka(
const Layout& target, // the target layout we want to compose with
const ThrLayout& thr_layout = {}, // (t0,...) -> thr_idx
const ValLayout& val_layout = {} // (v0,...) -> val_idx
) {
auto layout = raked_product(thr_layout, val_layout); // (t0*v0,t1*v1,...) -> (thr_idx,val_idx)
auto layout_tv = right_inverse(layout).with_shape(make_shape(size(thr_layout), size(val_layout)));
return zipped_divide(target, layout).compose(layout_tv, _);
}
int main() {
constexpr int XSize = 16;
constexpr int YSize = 64;
int zsize = 32; // each warp process a seperate z idx, zsize is a runtime value
// The plan
constexpr int XVec = 4;
constexpr int YVec = 8;
constexpr int NumXVec = 4;
constexpr int NumYVec = 8;
// make_tile cause per mode product
auto layout = make_layout(make_shape(make_shape(make_shape(Int<XSize>{}, Int<YSize>{})), zsize));
auto tcoord = make_identity_tensor(layout.shape());
auto [layout_tvi_mine, layout_tvi_ccecka] = [&]() {
constexpr int NumThreads = 128;
constexpr int WarpSize = 32;
static_assert(NumXVec * NumYVec == WarpSize);
constexpr int NumWarps = NumThreads / WarpSize;
constexpr auto thr_layout = make_layout(make_shape(make_shape(Int<NumXVec>{}, Int<NumYVec>{}), Int<NumWarps>{}));
constexpr auto val_layout = make_layout(make_shape(make_shape(Int<XVec>{}, Int<YVec>{}), _1{}));
// X Y Z
// 16 64 32 target tensor
// 4 8 4 thr_layout
// 4 8 1 val_layout
// 16 64 4 raked_product(thr_layout, val_layout)
// 1 1 8 iter
print("thr_layout:"), print(thr_layout), print("\n");
print("val_layout:"), print(val_layout), print("\n");
return std::make_pair(make_tv_layout(layout, thr_layout, val_layout), make_tv_layout_ccecka(layout, thr_layout, val_layout));
}();
// Let see mapped coords of thread 0
auto t0_mine = tcoord.compose(layout_tvi_mine)(make_coord(0, _), _);
auto t0_ccecka = tcoord.compose(layout_tvi_ccecka)(make_coord(0, _), _);
print("size (mine): "), print(size(t0_mine)), print("size (ccecka): "), print(size(t0_ccecka)), print("\n");
for(int i=0;i<size(t0_mine); i++) {
print(i), print("\tcoord (mine): "), print(t0_mine(i)), print("\tcoord (ccecka): "), print(t0_ccecka(i)), print("\n");
}
// size (mine): 256size (ccecka): 256
// 0 coord (mine): (((0,0)),0) coord (ccecka): (((0,0)),0)
// 1 coord (mine): (((1,0)),0) coord (ccecka): (((0,8)),0)
// 2 coord (mine): (((2,0)),0) coord (ccecka): (((0,16)),0)
// 3 coord (mine): (((3,0)),0) coord (ccecka): (((0,24)),0)
// 4 coord (mine): (((4,0)),0) coord (ccecka): (((0,32)),0)
// 5 coord (mine): (((5,0)),0) coord (ccecka): (((0,40)),0)
// 6 coord (mine): (((6,0)),0) coord (ccecka): (((0,48)),0)
// 7 coord (mine): (((7,0)),0) coord (ccecka): (((0,56)),0)
// 8 coord (mine): (((8,0)),0) coord (ccecka): (((0,0)),1)
// 9 coord (mine): (((9,0)),0) coord (ccecka): (((0,8)),1)
// 10 coord (mine): (((10,0)),0) coord (ccecka): (((0,16)),1)
// 11 coord (mine): (((11,0)),0) coord (ccecka): (((0,24)),1)
// 12 coord (mine): (((12,0)),0) coord (ccecka): (((0,32)),1)
// 13 coord (mine): (((13,0)),0) coord (ccecka): (((0,40)),1)
// 14 coord (mine): (((14,0)),0) coord (ccecka): (((0,48)),1)
// 15 coord (mine): (((15,0)),0) coord (ccecka): (((0,56)),1)
return 0;
}
I found this because I applied the simplification, and it works pretty well for 1d and 2d case. For 3d, my sliced tensor is pretty weird and I scratched my head for a whole day before I finally decide to try the old version of tv layout maker.
I don't know what you mean by "works". Neither approach appears to be giving you what you actually intend. This is why I don't recommend attempting to reinvent partitioning.
The difference between the two is that "your's" is composing with
auto layout_tv_iter = make_layout(
right_inverse(layout_per_iter).with_shape(make_shape(size(thr_layout), size(val_layout))),
iter_layout
);
which is a Layout, and "mine" is composing with
make_tile(layout_tv, _)
which is a Tile and applied by-mode.
The other problems that are going to prevent this from generalizing are
raked_productis rank-sensitive and not applicable hierarchically.raked_product( ((X,Y),Z), ((A,B),C) )computes the "product" of(X,Y)and(A,B), not ofXandAthenYandB.- By applying
zipped_divideto theraked_product, you're asking for the composition of incompatible objects. The LHS is a tensor with logical coordinates and the RHS is a layout mapping something to(thr_idx,val_idx). The(thr_idx, val_idx)are not intended to be logical coordinates into the tensor, so the composition doesn't make sense. - The
iter_layoutbeing composed with the original tensor doesn't make much sense either.
For these kinds of partitions, you have to ask yourself what you actually care about
- Do you care about the values assignment? It appears you do, but only for vectorization reasons?
- Do you actually care about the thread assignment? It appears you do not
- Are these partitioning patterns specified by an MMA_Op or a Copy_Op?
- Are your tensors static or dynamic or both?
- Do you need to apply the same partitioning pattern to multiple tensors?
The pattern that CuTe uses for partitioning is a Tiler o LayoutTV operator, which you've seen in the TiledMMA and TiledCopy and you could find a more naked version in partition.hpp.
template <class Layout, class Tiler, class LayoutTV>
__host__ __device__ constexpr auto
tv_tiling(Layout const& target, // the target layout we want to compose with
Tiler const& tiler, // the tiler to extract out relevant elements
LayoutTV const& layout_tv) // the TV transform to apply to the tile of data
{
return zipped_divide(target, tiler).compose(layout_tv, _);
}
int main() {
auto layout = make_layout(make_shape(make_shape(make_shape(Int<16>{}, Int<64>{})), 32));
auto tcoord = make_identity_tensor(layout.shape());
// Select out of the target the tile of data we want to work on
auto tiler = Shape<Shape<Shape<_16,_64>>, _4>{}; // (M,N,L) -> tensor coords
// Transform the tile of data to a TV format
auto thr_layout = Layout<Shape<_4,_8,_4>>{}; // (m,n,l) -> thr_idx
auto val_layout = Layout<Shape<_4,_8,_1>>{}; // (m,n,l) -> val_idx
// (M,N,L) -> (thr_idx,val_idx)
auto thrval_layout = raked_product(thr_layout, val_layout);
// (thr_idx, val_idx) -> (M,N,L)
auto layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout));
// We've construct the layout_tv to index into the coords of the tiler
CUTE_STATIC_ASSERT_V(cosize(layout_tv) == size(tiler));
auto result = tv_tiling(tcoord, tiler, layout_tv); // ((thr, val), iter)
// Let see mapped coords of thread 0
auto t0_result = result(make_coord(0, _), _);
for(int i = 0; i < size(t0_result); ++i) {
print(i), print("\t"), print(t0_result(i)), print("\n");
}
// 0 (((0,0)),0) // Each thread gets a (((4,8)),1) "block" of data 8 times in the target
// 1 (((1,0)),0)
// 2 (((2,0)),0)
// 3 (((3,0)),0)
// 4 (((0,1)),0)
// 5 (((1,1)),0)
// 6 (((2,1)),0)
// 7 (((3,1)),0)
// 8 (((0,2)),0)
// 9 (((1,2)),0)
// 10 (((2,2)),0)
// 11 (((3,2)),0)
// 12 (((0,3)),0)
// 13 (((1,3)),0)
// 14 (((2,3)),0)
// 15 (((3,3)),0)
// ...
}
This is exactly the interface of make_tiled_copy_impl that I previously recommended. My upcoming documentation update includes sections on Tilers and products.
I start to see some blind spots that was ignore by me previously.
The very important one is in zipped_divide(target, tiler).compose(layout_tv, _) that I always want to derive the tiler from thr_layout and val_layout. This is why I always rely on the raked_product instread of being explicit on tiler.
derived_tiler = raked_product(thr_layout, val_layout); // this the confused part, it should be explicit, however.
thrval_layout = raked_product(thr_layout, val_layout); // forward mapping
layout_tv = right_inverse(thrval_layout).with_shape(size(thr_layout), size(val_layout)) // reversed mapping
In my previous example, each warp process a whole XY slice, each warp iter over 8 Z indices, by relying on raked_product, the only thing I can achieve is each warp iterating over 8 consecutive Z indices, and I was wondering how to achive itering over 8 strided Z indices, I know how to do it now: throw away the derived one and design the tiler also.
I also see some blind spots of yours now.
By applying zipped_divide to the raked_product, you're asking for the composition of incompatible objects.
I wouldn't call it incompatible, by my thr and val_layout design, the result of raked_product should cover the whole XY slice, thus zipped_divide should only produce iter mode over the Z indices (1,1,8), because it is not fully covered. But if you imagine XY grows from 16x64 to something like 17x64, then iter mode grows from (1,1,8) to (2,1,8). This totally makes sense because of the over-approximating nature.
The iter_layout being composed with the original tensor doesn't make much sense either.
Sorry for the confusion, this is some uncleaned code from copying, the target.compose(layout_tv_iter) is nop if you read it as a whole of tensor.compose(tensor_layout.compose(layout_tv_iter)), it should be cleaned up as tensor.compose(layout_tv_iter), It just happens to not create errors.
For rank-sensitivity of raked_product, I'd imagine an utility record hierarchy -> flatten -> product -> unflatten with recorded hierarchy, but may not be useful...
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
Closing due to inactivity
@mnicely Sorry, but I think this is still relevant. Inactivity does not imply it is solved.
The bugs I pointed out in https://github.com/NVIDIA/cutlass/issues/1284#issuecomment-1870749414 were fixed with the 3.4 docs update. Those fixes strengthen and improve the post-conditions of rank-sensitive raked_product and blocked_product.
The rest of the concepts in this thread remain intact, but the functionality of the original code should be more consistent with expectations.
I manually verified with lastest main@ bbe579a9e3beb6ea6626d9227ec32d0dae119a49, The case 1 works now (aka, for 1d data Thr and Val). The case 3 works as expected now (aka, for 2d data with 1d Thr and Val with automatic padding). I'd assume higher rank cases will work as expected also.
@ccecka I had been thinking that this was caused by the design that it intrinsically only supported 2d, But it turn out to be a bug. Sorry for misunderstanding.