Simple Block Reduce Fails when using `while` loops
Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...
I'm implementing the block-wise reduce following this tutorial with this simple-looking code:
using KernelAbstractions
using CUDA
using CUDAKernels
@kernel function block_reduce(out, in)
# Get block / workgroup size
bs = @uniform @groupsize()[1]
# Block / group index, thread index within block, global thread index
bi = @index(Group, Linear)
ti = @index(Local, Linear)
gi = @index(Global, Linear)
# Copy each thread's corresponding item from global to shared memory
cache = @localmem eltype(out) (bs,)
cache[ti] = in[gi]
@synchronize
# Reduce elements in shared memory using sequential addressing following
# https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
@private s = bs ÷ 2
while s > 0
if ti < s
cache[ti] += cache[ti + s]
end
@synchronize
s = s >> 1
end
# Copy result back to global memory
if ti == 1
out[bi] = cache[1]
end
end
num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size
in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray
kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)
wait(ev)
println(out)
It shouldn't be more exotic than the example code in the docs - however, these two lines:
@private s = bs ÷ 2
while s > 0
Produce the following errors:
Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:31
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:31
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
[1] macro expansion
@ ~/Prog/Julia/KALib/prototype/reduce.jl:32
[2] gpu_block_reduce
@ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
[3] gpu_block_reduce
@ ./none:0
Reason: unsupported dynamic function invocation (call to >)
[...Stacktrace...]
I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1] and @private s = bs ÷ 2 are not inferred as being integers.
If I switch the arrays and device to CPU() I get the following error:
nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})
Would you know why these errors appear or how I could investigate (and fix..) them?
Thanks, Leonard
Hm I need to think through the semantics of while loops on the CPU... #262
You can use @macroexpand to debug the lowering of the kernel. You should see two functions being generated one for the GPU the other for the CPU
This looks like a scope issue.
In general I was thinking that instead of impliementing block reduce in KA we would provide a reduce function or macro that the backends need to implement.
On the CPU the while loop misses an index into the thread-local @private variable:
while s > 0
begin
var"##N#486" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
begin
#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:264 =#
for var"##I#485" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:265 =#
(KernelAbstractions.__validindex)(__ctx__, var"##I#485") || continue
#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:266 =#
bi = KernelAbstractions.__index_Group_Linear(__ctx__, var"##I#485")
ti = KernelAbstractions.__index_Local_Linear(__ctx__, var"##I#485")
gi = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#485")
#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:267 =#
if ti < s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]
#= REPL[18]:21 =#
cache[ti] += cache[ti + s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]]
end
#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:268 =#
end
end
end
I notice that the @private s is normally used as s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")] (the NTuple GPU semantics simulation is really clever!) but the while s > 0 condition doesn't.
Does the length((KernelAbstractions.__workitems_iterspace)(__ctx__)) need to be redefined so many times? Could it be defined only once outside the user-code loops?
Also - would a break then work correctly in a @kernel ?
On the GPU I see:
[...]
bs = ((#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/KernelAbstractions.jl:127 =#
(KernelAbstractions.groupsize)(__ctx__)))[1]
[...]
var"#177#s" = KernelAbstractions.bs ÷ 2
while s > 0
[...]
Why is bs the only variable to be accessed as KernelAbstractions.bs rather than the locally-defined bs? That explains the undefined name and unsupported dynamic function (div).
Analog for var"#177#s" and s. I find the macros.jl source a bit hard to follow - I need to wrap my head around macros better - but would you know where to go next to fix these transformations?
Thanks for the quick reply, I'll be using @macroexpand in the future! This could go into a "Debugging Kernels" section in the docs..
So one thing to think through is what a while loop with a @synchronize inside should look like.
Yeah the @synchronize makes while loops hard....
s = MVector(Int, length(wkgrp))
mask = map(s->s>0, s)
while any(mask)
for tid in wkgrp
mask[tid] || continue
if tid < s[tid]
cache[ti] += cache[ti + s[tid]]
end
end
for tid in wkgrp
mask[tid] || continue
s[tid] = s[tid] >> 1
end
mask = map(s->s>0, s)
end
But we certainly couldn't support break
We could solve break through introducing a mask... I like this direction, but it is something that the current architecture doesn't easily support.