KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

groupreduction and subgroupreduction

Open brabreda opened this issue 2 years ago • 10 comments

I am unsure why my previous PR closed but here are the changes.

  • I added docs
  • I added tests

It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?

brabreda avatar Sep 09 '23 12:09 brabreda

Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction.

So as an example could we "just" add rval= @reduce(op, val) to perform a workgroup level reduction?

What is rval across all members of a workgroup. Is it the same or may it divergent and only the first workitem will receive a correct answer?

Do we need a broadcast/shuffle operation?

I don't have the answers to those questions.

I think adding @reduce is definitely something we want to do, but subgroup ops may need more thought.

vchuravy avatar Sep 09 '23 19:09 vchuravy

  • rval=@reduce(op, val) is possible. In this case we only use localmem and @synchronize for the reduction.

  • at the moment only the first thread of the group returns an item, but I think it would be possible to change this.

  • I Think providing a neutral value will have better performance, but I am not sure. I Will check this.

I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way.

brabreda avatar Sep 09 '23 21:09 brabreda

Yeah requiring a neutral value sounds good.

Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision.

vchuravy avatar Sep 09 '23 21:09 vchuravy

The problem with that approach is that warp intrinsics don’t work for every type.

We then would need something like this:

__reduce(op, val::union{type1, …},) for wraps __reduce(op, val,) for everything else

In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this.

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

brabreda avatar Sep 10 '23 01:09 brabreda

would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.

No it just needs to be documented :) and we might want to add a @broadcast.

So it looks like CUDA.jl makes this decision based on the element type? https://github.com/JuliaGPU/CUDA.jl/blob/d57e02018ccf00ceda7672f5f4d98e2ceef9106d/src/mapreduce.jl#L177-L181

vchuravy avatar Sep 10 '23 02:09 vchuravy

Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible:

For the implementation without warps:

@reduce(op,val,neutral)
__reduce(op,val,neutral)

And every back-end like CUDAKernels just implements something like this:

__reduce(op,val::T,neutral) where {T <: Union{Float32, Float64, etc...}}

It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But, I tried It today and I get a lot of dynamic invocation errors. But I am not sure what's at fault.

brabreda avatar Sep 10 '23 22:09 brabreda

I recommed using Cthulhu, using CUDA and then using CUDA.device_code_typed interactive=true

vchuravy avatar Sep 11 '23 21:09 vchuravy

Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI.

brabreda avatar Sep 14 '23 20:09 brabreda

Okay so this is API neutral? Pure addition and we don't need anything new? Great! Otherwise we could have wrapped that in with https://github.com/JuliaGPU/KernelAbstractions.jl/pull/422

vchuravy avatar Sep 19 '23 14:09 vchuravy

The API is indeed neutral. The @synchronize is a problem that we can't get around, I was maybe thinking about adding a value to __ctx__ or something similar to indicate if we are compiling for GPU or CPU. Based on that value we can create a different implementation for CPUs by using Val(). If we go that route we can add something like a deviceversion which would allow the use of newer operations like the warp reduce operation which does not uses shuffle instructions (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities).

brabreda avatar Sep 20 '23 11:09 brabreda