ziyuhuang123
ziyuhuang123
No, I mean....directly use in windows?
oh, thanks!
Like the example here, what does the output mean? ``` // Tile a tensor according to the flat shape of a layout that provides the coordinate of the target index....
So I print(A.shape), I still get a value, but actually it is different from normal tensor definition. So why here it is still a "tensor" object??? Confusing!
It seems that step(_1, X) mean, for first dimension, divide as normal, for the second dimension, do not divide.
Also, for detailed local_partition, how it divide the data? Do we have bank conflict(yes, we will have), and how can we avoid it?
Emmm, thank you, I have read all three blogs you mentioned, but you are discussing cuda core ..... I am learning tensor core so I am reading cutlass. ?
> Off topic: Just came across this issue (as a github-mancer). Based on your recent questions I assume you want to write gemm from ground up. And not to be...
I mean, using CuTe.
Thank you very much for your reply!!!! I noticed you are using "auto thr_mma = tiled_mma.get_slice(thread_idx);" So what is its difference with: "auto tAgA = local_partition(gA, tA, threadIdx.x); // (THR_M,THR_K,k)"...