Reduction interface
@glwagner needs a reduction interface so we should finally add that.
cc: @jpsamaroo
What kinds of reduction intrinsics does CUDA support? AMDGPU has wfred, which reduces a single value across all active lanes. I figure this could be easy to support, like result = @reduce_warp op input
maybe something like in FoldsCUDA?
Even if there is no portable access to lower-level reduction primitives it would be good to have an example of a reduction operation (e.g. sum, maximum), probably implemented via local memory. The histogram example does that but it's more complicated than a straightforward reduction.