Haicheng Wu comments

Results 323 comments of


                                            Haicheng Wu

[FEA] Would group conv be supported in cutlass future release?

Noted. I know cutlass is hot. It is used by many researchers and productised by quite a few companies. We get requests to add new functionalities all the time. One...

[FEA] Would group conv be supported in cutlass future release?

We are working on this. @loo-loo

[FEA] Would group conv be supported in cutlass future release?

group conv is supported in 2.10. we will keep improving it.

[RFE] Optimize the conv-fprop operator

Your `RSC` is 64 which means your kernel is completely memory bound and only needs 2 iterations of the mainloop. You can use cutlass profiler (https://github.com/NVIDIA/cutlass/blob/master/media/docs/profiler.md) to profile all available...

[RFE] Optimize the conv-fprop operator

maybe you can also try threadblock size 128x64 and warp size 64x32. As I said earlier, this problem size is completely memory bound, Your performance data looks reasonable to me.

[RFE] Optimize the conv-fprop operator

Your cuda is too old. I recommend to use 11.6+. See https://github.com/NVIDIA/cutlass/discussions/495 Again, your problem size is completely memory bound. Your perf is not bad.

[RFE] Optimize the conv-fprop operator

see https://github.com/NVIDIA/cutlass/blob/master/media/images/cutlass-2.9-implicit-gemm-performance.png

[RFE] Optimize the conv-fprop operator

> do you have the performance comparison data between the high version of cuda and cuda10.2? roughly 10%, but again yours is compute bound. better codegen cannot help much. >...

[RFE] Optimize the conv-fprop operator

cuda is getting better.

[QST] How slice K reduce the value?

> we want to reduce between different wraps? Correct. > So we can not use wrap_sync function like shlf_down_sync? We will use atomicAdd to global memory? We reduce between warps....