ziyuhuang123
ziyuhuang123
**What is your question?** I first have an array of pointers A. Then, following layout B, I obtain a tensor T0. Similarly, I have another array of pointers B (different...
I see DelayTmaStore in the code but I do not understand when we need it. Could anyone tell me? Thanks!
**What is your question?** Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of...
I don’t understand why MMA in-flight is used in SS_WarpSpecialized. As shown in the diagram below, I’ve illustrated my understanding with a pipeline diagram. If there is MMA in-flight, then...
Could you please explain why CUTLASS separates the first iteration of the K dimension in matrix multiplication? Does this really improve performance?
May I kindly ask why the swizzle configuration in CUTLASS is specifically set to 3, 4, and 3? I would greatly appreciate any insights or explanations regarding the rationale behind...
Could you explain how TMA works in CUTLASS? For example, when writing from the shared memory Tensor sS to the global memory Tensor gD, it seems that the data is...
In CUTLASS, there is a tma_store_wait function, which corresponds to cp.async.bulk.wait_group.read. Based on my observations while working with TMA, it seems that after completing a TMA-store operation, waiting is not...
[QST]Is the Key Difference Between mbarrier and barrier Their Handling of Producer-Consumer Count?
Barrier is used for scenarios where the number of producers and consumers is the same, while mbarrier is used when the numbers differ. Is this conclusion correct? Is this the...
In scenarios where both producer and consumer threads exist, how can we achieve synchronization using CUTLASS's barrier.sync/arrive? I understand that in barrier.arrive(a, b), a represents the number of threads required...