gtensor icon indicating copy to clipboard operation
gtensor copied to clipboard

add multi-d reductions

Open bd4 opened this issue 3 years ago • 0 comments

We already support one multi-d reduction, sum_axis_to, which is implemented in a way that requires lots of segments to perform well (there is one thread per output array element, and the sum is done linearly). This is a special case of segmented reduction, where all segments have the same length. Clever use of thrust::reduce_by_key can be used to do implement, but may not perform particularly well, at least based on this thorough (but very old, 2013) analysis: https://moderngpu.github.io/segreduce.html another option is to use CUB/rocPrim/hipCUB segmented reductions: https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html https://codedocs.xyz/ROCmSoftwarePlatform/rocPRIM/group__devicemodule.html#ga02e109ce385d48b1fbde10d42900532d for SYCL, we can use the standard reduction API, but it will likely be a bit more work to optimize well across different output/reduction length profiles (assuming CUB/rocPrim even do such optimizations).

bd4 avatar Mar 29 '22 11:03 bd4