ziyuhuang123

Results 61 issues of ziyuhuang123

**What is your question?** I first have an array of pointers A. Then, following layout B, I obtain a tensor T0. Similarly, I have another array of pointers B (different...

question
? - Needs Triage
inactive-30d

I see DelayTmaStore in the code but I do not understand when we need it. Could anyone tell me? Thanks!

question
? - Needs Triage

**What is your question?** Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of...

question
? - Needs Triage

I don’t understand why MMA in-flight is used in SS_WarpSpecialized. As shown in the diagram below, I’ve illustrated my understanding with a pipeline diagram. If there is MMA in-flight, then...

question
? - Needs Triage

Could you please explain why CUTLASS separates the first iteration of the K dimension in matrix multiplication? Does this really improve performance?

question
? - Needs Triage

May I kindly ask why the swizzle configuration in CUTLASS is specifically set to 3, 4, and 3? I would greatly appreciate any insights or explanations regarding the rationale behind...

question
? - Needs Triage
inactive-30d

Could you explain how TMA works in CUTLASS? For example, when writing from the shared memory Tensor sS to the global memory Tensor gD, it seems that the data is...

question
? - Needs Triage
inactive-30d

In CUTLASS, there is a tma_store_wait function, which corresponds to cp.async.bulk.wait_group.read. Based on my observations while working with TMA, it seems that after completing a TMA-store operation, waiting is not...

question
? - Needs Triage
inactive-30d

Barrier is used for scenarios where the number of producers and consumers is the same, while mbarrier is used when the numbers differ. Is this conclusion correct? Is this the...

question
? - Needs Triage
inactive-30d

In scenarios where both producer and consumer threads exist, how can we achieve synchronization using CUTLASS's barrier.sync/arrive? I understand that in barrier.arrive(a, b), a represents the number of threads required...

question
? - Needs Triage
inactive-30d