Junkai-Wu comments

Results 36 comments of


                                            Junkai-Wu

About the make_tensor function

I don't quite understand what's your issue here from the description above. From my observation, there is only one type of `make_tensor` call here: ``` Tensor mA = make_tensor(make_gmem_ptr(A), select(shape_MNK),...

[BUG] ElementC=void kernel reads non-void in `GemmDescription`

@manishucsd I ran the latest cutlass code and the `description_.C.element` of above kernels are all `void`. Could you verify again?

[QST]when will DelayTmaStore be important?

It's to optimize the dependency between storing to shared memory (`ss`) and tma storing to global memory (`sg`) in epilogue. Usually one `sg` is issued after one `ss`. If DelayTmaStore...

[QST]Is the Key Difference Between mbarrier and barrier Their Handling of Producer-Consumer Count?

This conclusion is incorrect. Both barrier and mbarrier are based on arrive-wait mechanism. This mechanism doesn't necessary require the number of producer and consumer is the same. mbarrier is more...

[QST]How to Handle Synchronization with Different Thread Counts for Producer and Consumer in CUTLASS?

The example you showed is a hopper warp specialized kernel where mma warp groups execute mma operations + epilogue operations. Therefore, when executing next mma operation, it has to make...

Add `infinity` to `cutlass::platform::numeric_limits<half_t>`

This change has been added in `cutlass/half.h`. This PR can be closed @hwu36

[QST]From index into a coordinate (or coordniate into a index), it has two different implementations, how should one distinguish and understand the scenarios for their use?

To calculate from a coord to an index, you need a shape and a stride. If the stride is explicitly provided, the function calculates the index using the given stride....

Junkai-Wu

About the make_tensor function

[BUG] ElementC=void kernel reads non-void in `GemmDescription`

[QST]when will DelayTmaStore be important?

[QST]Is the Key Difference Between mbarrier and barrier Their Handling of Producer-Consumer Count?

[QST]How to Handle Synchronization with Different Thread Counts for Producer and Consumer in CUTLASS?

Add `infinity` to `cutlass::platform::numeric_limits<half_t>`

[QST]From index into a coordinate (or coordniate into a index), it has two different implementations, how should one distinguish and understand the scenarios for their use?

[BUG] calling cast_smem_ptr_to_uint(device fn) from make_gmma_desc(host device fn) is not allowed

[BUG] calling cast_smem_ptr_to_uint(device fn) from make_gmma_desc(host device fn) is not allowed

[BUG] calling cast_smem_ptr_to_uint(device fn) from make_gmma_desc(host device fn) is not allowed