quda
quda copied to clipboard
Fix performance of transform_reduce
Reminder to myself. Since it was made generic, some kernels based on transform_reduce are seeing significant regressions. This is because it maps to the multi-reduction kernel, which limits the number of threads per block instantiated to 256. Previously, it was instantiated for 512 thread per block.