Port `reverse` from CUDA.jl
This may have to wait for KA 0.10 depending on how much cpu=true affects performance.
Seems like at least with CUDA.jl, using dynamic workgroup sizes recovers ~50% of the performance lost switching over to KernelAbstractions. Is there potentially some overhead with KA that is lesser with Dynamic workgroup sizes?
Is there potentially some overhead with KA that is lesser with Dynamic workgroup sizes?
cc @vchuravy
Huh, I would expect static kernel sizes to be a performance benefit or at least performance neutral.
The only thing that could happen is that suddenly we are able to unroll more or something like that.