quda Fix performance of transform

Fix performance of transform_reduce

Open maddyscientist opened this issue 4 years ago • 0 comments

Reminder to myself. Since it was made generic, some kernels based on transform_reduce are seeing significant regressions. This is because it maps to the multi-reduction kernel, which limits the number of threads per block instantiated to 256. Previously, it was instantiated for 512 thread per block.

Mar 04 '21 17:03 maddyscientist

quda quda copied to clipboard

Fix performance of transform_reduce

quda
quda copied to clipboard