Port `reverse` from CUDA.jl

Open christiangnrd opened this issue 4 months ago • 3 comments

This may have to wait for KA 0.10 depending on how much cpu=true affects performance.

Aug 04 '25 19:08 christiangnrd

Seems like at least with CUDA.jl, using dynamic workgroup sizes recovers ~50% of the performance lost switching over to KernelAbstractions. Is there potentially some overhead with KA that is lesser with Dynamic workgroup sizes?

Aug 05 '25 21:08 christiangnrd

Is there potentially some overhead with KA that is lesser with Dynamic workgroup sizes?

cc @vchuravy

Sep 01 '25 05:09 maleadt

Huh, I would expect static kernel sizes to be a performance benefit or at least performance neutral.

The only thing that could happen is that suddenly we are able to unroll more or something like that.

Sep 01 '25 09:09 vchuravy