KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
groupreduction and subgroupreduction
I am unsure why my previous PR closed but here are the changes.
- I added docs
- I added tests
It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?
Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction.
So as an example could we "just" add rval= @reduce(op, val) to perform a workgroup level reduction?
What is rval across all members of a workgroup. Is it the same or may it divergent and only the first workitem will receive a correct answer?
Do we need a broadcast/shuffle operation?
I don't have the answers to those questions.
I think adding @reduce is definitely something we want to do, but subgroup ops may need more thought.
-
rval=@reduce(op, val)is possible. In this case we only uselocalmemand@synchronizefor the reduction. -
at the moment only the first thread of the group returns an item, but I think it would be possible to change this.
-
I Think providing a neutral value will have better performance, but I am not sure. I Will check this.
I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way.
Yeah requiring a neutral value sounds good.
Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision.
The problem with that approach is that warp intrinsics don’t work for every type.
We then would need something like this:
__reduce(op, val::union{type1, …},) for wraps
__reduce(op, val,) for everything else
In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this.
would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.
would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce.
No it just needs to be documented :) and we might want to add a @broadcast.
So it looks like CUDA.jl makes this decision based on the element type? https://github.com/JuliaGPU/CUDA.jl/blob/d57e02018ccf00ceda7672f5f4d98e2ceef9106d/src/mapreduce.jl#L177-L181
Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible:
For the implementation without warps:
@reduce(op,val,neutral)
__reduce(op,val,neutral)
And every back-end like CUDAKernels just implements something like this:
__reduce(op,val::T,neutral) where {T <: Union{Float32, Float64, etc...}}
It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But, I tried It today and I get a lot of dynamic invocation errors. But I am not sure what's at fault.
I recommed using Cthulhu, using CUDA and then using CUDA.device_code_typed interactive=true
Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI.
Okay so this is API neutral? Pure addition and we don't need anything new? Great! Otherwise we could have wrapped that in with https://github.com/JuliaGPU/KernelAbstractions.jl/pull/422
The API is indeed neutral. The @synchronize is a problem that we can't get around, I was maybe thinking about adding a value to __ctx__ or something similar to indicate if we are compiling for GPU or CPU. Based on that value we can create a different implementation for CPUs by using Val(). If we go that route we can add something like a deviceversion which would allow the use of newer operations like the warp reduce operation which does not uses shuffle instructions (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities).