Junkai-Wu

Results 36 comments of Junkai-Wu

I don't quite understand what's your issue here from the description above. From my observation, there is only one type of `make_tensor` call here: ``` Tensor mA = make_tensor(make_gmem_ptr(A), select(shape_MNK),...

@manishucsd I ran the latest cutlass code and the `description_.C.element` of above kernels are all `void`. Could you verify again?

It's to optimize the dependency between storing to shared memory (`ss`) and tma storing to global memory (`sg`) in epilogue. Usually one `sg` is issued after one `ss`. If DelayTmaStore...

This conclusion is incorrect. Both barrier and mbarrier are based on arrive-wait mechanism. This mechanism doesn't necessary require the number of producer and consumer is the same. mbarrier is more...

The example you showed is a hopper warp specialized kernel where mma warp groups execute mma operations + epilogue operations. Therefore, when executing next mma operation, it has to make...

This change has been added in `cutlass/half.h`. This PR can be closed @hwu36

To calculate from a coord to an index, you need a shape and a stride. If the stride is explicitly provided, the function calculates the index using the given stride....

We should add a `#if defined(__CUDA_ARCH__)` guard around `cast_smem_ptr_to_uint` function to prevent it called from host. I'll file a PR for this.

Got it. I'll make all functions calling `cast_smem_ptr_to_uint` to be `HOST_DEVICE`