ziyuhuang123 issues

Results 61 issues of


                                            ziyuhuang123

[QST]Are Tensors Equivalent After Different Layout Transformations?

**What is your question?** I first have an array of pointers A. Then, following layout B, I obtain a tensor T0. Similarly, I have another array of pointers B (different...

question

? - Needs Triage

inactive-30d

[QST]when will DelayTmaStore be important?

I see DelayTmaStore in the code but I do not understand when we need it. Could anyone tell me? Thanks!

question

? - Needs Triage

[QST]Inquiry About the Computation Size in a Single cute::gemm Call in CUTLASS

**What is your question?** Could you please explain how large a single cute::gemm computation is in CUTLASS? Since multiple threads compute together, and it doesn’t explicitly state the number of...

question

? - Needs Triage

[QST]Question About the Use of MMA In-Flight in SS_WarpSpecialized

I don’t understand why MMA in-flight is used in SS_WarpSpecialized. As shown in the diagram below, I’ve illustrated my understanding with a pipeline diagram. If there is MMA in-flight, then...

question

? - Needs Triage

[QST]Why Does CUTLASS Handle the First K Dimension Separately in Matrix Multiplication?

Could you please explain why CUTLASS separates the first iteration of the K dimension in matrix multiplication? Does this really improve performance?

question

? - Needs Triage

[QST]Why Does CUTLASS Use 3-4-3 Swizzle?

May I kindly ask why the swizzle configuration in CUTLASS is specifically set to 3, 4, and 3? I would greatly appreciate any insights or explanations regarding the rationale behind...

question

? - Needs Triage

inactive-30d

[QST]How Does TMA Work in CUTLASS for Writing from Shared Memory to Global Memory?

Could you explain how TMA works in CUTLASS? For example, when writing from the shared memory Tensor sS to the global memory Tensor gD, it seems that the data is...

question

? - Needs Triage

inactive-30d

[QST]Behavior of TMA Store and Wait Mechanism in CUTLASS

In CUTLASS, there is a tma_store_wait function, which corresponds to cp.async.bulk.wait_group.read. Based on my observations while working with TMA, it seems that after completing a TMA-store operation, waiting is not...

question

? - Needs Triage

inactive-30d

[QST]Is the Key Difference Between mbarrier and barrier Their Handling of Producer-Consumer Count?

Barrier is used for scenarios where the number of producers and consumers is the same, while mbarrier is used when the numbers differ. Is this conclusion correct? Is this the...

question

? - Needs Triage

inactive-30d

[QST]How to Handle Synchronization with Different Thread Counts for Producer and Consumer in CUTLASS?

In scenarios where both producer and consumer threads exist, how can we achieve synchronization using CUTLASS's barrier.sync/arrive? I understand that in barrier.arrive(a, b), a represents the number of threads required...

question

? - Needs Triage

inactive-30d