add multi-d reductions
We already support one multi-d reduction, sum_axis_to, which is implemented in a way that requires lots of segments to perform well (there is one thread per output array element, and the sum is done linearly). This is a special case of segmented reduction, where all segments have the same length. Clever use of thrust::reduce_by_key can be used to do implement, but may not perform particularly well, at least based on this thorough (but very old, 2013) analysis:
https://moderngpu.github.io/segreduce.html
another option is to use CUB/rocPrim/hipCUB segmented reductions:
https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html
https://codedocs.xyz/ROCmSoftwarePlatform/rocPRIM/group__devicemodule.html#ga02e109ce385d48b1fbde10d42900532d
for SYCL, we can use the standard reduction API, but it will likely be a bit more work to optimize well across different output/reduction length profiles (assuming CUB/rocPrim even do such optimizations).