cutlass
cutlass copied to clipboard
[QST] Don't konw how to use predicate tensor.
I encountered some problems when using predicate tensor.
In the tutorials:
https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/tiled_copy.cu https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0y_predication.md There are examples of how to use tiled copy and predication tensor, but I encountered several issues while trying them.
First, I’m unclear on how to combine vectorized copy with predication. I might need some examples. In the context of vectorized memory access, I used the following memory access statement:
using AccessType = cutlass::AlignedArray<fp16_t, 8>;
using Atom = Copy_Atom<UniversalCopy<AccessType>, fp16_t>;
auto tiled_copy = make_tiled_copy(
Atom{}, ThreadLayout{}, VecLayout{}
);
Where ThreadLayout = (256):(1) and VecLayout = (8):(1).
I expect each thread to access 8 elements and copy them from memory(256 threads in total). Suppose we have 16 elements to copy. I want only the first two threads to conduct the copy instruction. However, I’m unsure how to set the predicate tensor. Should I set a predicate tensor of (8):(1) for each element or a predicate tensor of (1):(1) for the entire vector? Both cases don’t seem to throw errors.
Secondly, since tiled copy automatically loops through reads, for example, when tAgA = (4088):(1), and i am calling:
copy(tiled_copy, tAgA, ...)
Since my tiled copy can only read 8 * 256 elements at a time, tAgA needs to loop 4 times. But during the last loop, the last thread needs to be masked out by the predicate tensor. I’m unsure how to handle this. What should the predicate tensor look like in this case?
Finally, I encountered more difficulties in predicate tensor with mma tiled copy. Since these operations eventually call the ldmatrix instruction to read the data, and ldmatrix instructions have their special data layout when loading data, I’m not sure how to mask out the OOB data. Should I carefully study the layout of the ldmatrix instruction to manually edit a suitable predication tensor? Could you provide a simple Gemm example with OOB data to help me learn how to use the predicate tensor properly?