Haicheng Wu
Haicheng Wu
Noted. I know cutlass is hot. It is used by many researchers and productised by quite a few companies. We get requests to add new functionalities all the time. One...
We are working on this. @loo-loo
group conv is supported in 2.10. we will keep improving it.
Your `RSC` is 64 which means your kernel is completely memory bound and only needs 2 iterations of the mainloop. You can use cutlass profiler (https://github.com/NVIDIA/cutlass/blob/master/media/docs/profiler.md) to profile all available...
maybe you can also try threadblock size 128x64 and warp size 64x32. As I said earlier, this problem size is completely memory bound, Your performance data looks reasonable to me.
Your cuda is too old. I recommend to use 11.6+. See https://github.com/NVIDIA/cutlass/discussions/495 Again, your problem size is completely memory bound. Your perf is not bad.
see https://github.com/NVIDIA/cutlass/blob/master/media/images/cutlass-2.9-implicit-gemm-performance.png
> do you have the performance comparison data between the high version of cuda and cuda10.2? roughly 10%, but again yours is compute bound. better codegen cannot help much. >...
cuda is getting better.
> we want to reduce between different wraps? Correct. > So we can not use wrap_sync function like shlf_down_sync? We will use atomicAdd to global memory? We reduce between warps....