KernelAbstractions.jl Simple Block Reduce Fails when using `while` loops

Hi, thank you for developing this library - I would like to write optimised kernels for common GPU algorithms such as reduce, scan, radix sort, etc. similar to CUB but available on all KernelAbstractions platforms. The resulting "KA standard library" (KALib? Caleb?) could be used as a benchmark for future KA development & optimisation - and I can use the lessons along the way to populate the "Writing Kernels" section in the documentation. Big plans, but...

I'm implementing the block-wise reduce following this tutorial with this simple-looking code:

using KernelAbstractions
using CUDA
using CUDAKernels


@kernel function block_reduce(out, in)

    # Get block / workgroup size
    bs = @uniform @groupsize()[1]

    # Block / group index, thread index within block, global thread index
    bi = @index(Group, Linear)
    ti = @index(Local, Linear)
    gi = @index(Global, Linear)

    # Copy each thread's corresponding item from global to shared memory
    cache = @localmem eltype(out) (bs,)
    cache[ti] = in[gi]
    @synchronize

    # Reduce elements in shared memory using sequential addressing following
    # https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    @private s = bs ÷ 2
    while s > 0
        if ti < s
            cache[ti] += cache[ti + s]
        end

        @synchronize

        s = s >> 1
    end

    # Copy result back to global memory
    if ti == 1
        out[bi] = cache[1]
    end
end


num_blocks = 10
block_size = 32
num_elements = num_blocks * block_size

in = rand(1:10, num_elements) |> CuArray
out = zeros(num_blocks) |> CuArray


kernel_reduce = block_reduce(CUDADevice(), block_size)
ev = kernel_reduce(out, in, ndrange=num_elements)

wait(ev)
println(out)

It shouldn't be more exotic than the example code in the docs - however, these two lines:

    @private s = bs ÷ 2
    while s > 0

Produce the following errors:

Reason: unsupported use of an undefined name (use of 'bs')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to div)
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:31
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported use of an undefined name (use of 's')
Stacktrace:
 [1] macro expansion
   @ ~/Prog/Julia/KALib/prototype/reduce.jl:32
 [2] gpu_block_reduce
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_block_reduce
   @ ./none:0
Reason: unsupported dynamic function invocation (call to >)

[...Stacktrace...]

I tried following the code using Cthulhu.jl, but the errors appear simple: it's calling div(::Any, ::Int64) and >(::Any, ::Int64), so I assume the bs = @uniform @groupsize()[1] and @private s = bs ÷ 2 are not inferred as being integers.

If I switch the arrays and device to CPU() I get the following error:

    nested task error: MethodError: no method matching isless(::Int64, ::NTuple{32, Int64})

Would you know why these errors appear or how I could investigate (and fix..) them?

Thanks, Leonard

Nov 15 '22 17:11 anicusan

Hm I need to think through the semantics of while loops on the CPU... #262

You can use @macroexpand to debug the lowering of the kernel. You should see two functions being generated one for the GPU the other for the CPU

This looks like a scope issue.

In general I was thinking that instead of impliementing block reduce in KA we would provide a reduce function or macro that the backends need to implement.

Nov 16 '22 16:11 vchuravy

On the CPU the while loop misses an index into the thread-local @private variable:

            while s > 0
                begin
                    var"##N#486" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                    begin
                        #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:264 =#
                        for var"##I#485" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:265 =#
                            (KernelAbstractions.__validindex)(__ctx__, var"##I#485") || continue
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:266 =#
                            bi = KernelAbstractions.__index_Group_Linear(__ctx__, var"##I#485")
                            ti = KernelAbstractions.__index_Local_Linear(__ctx__, var"##I#485")
                            gi = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#485")
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:267 =#
                            if ti < s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]
                                #= REPL[18]:21 =#
                                cache[ti] += cache[ti + s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")]]
                            end
                            #= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:268 =#
                        end
                    end
                end

I notice that the @private s is normally used as s[(KernelAbstractions.__index_Local_Linear)(__ctx__, var"##I#485")] (the NTuple GPU semantics simulation is really clever!) but the while s > 0 condition doesn't.

Does the length((KernelAbstractions.__workitems_iterspace)(__ctx__)) need to be redefined so many times? Could it be defined only once outside the user-code loops?

Also - would a break then work correctly in a @kernel ?

Nov 16 '22 18:11 anicusan

On the GPU I see:

[...]
                    bs = ((#= /home/andreinicusan/.julia/packages/KernelAbstractions/DqITC/src/KernelAbstractions.jl:127 =#
                            (KernelAbstractions.groupsize)(__ctx__)))[1]
[...]
                    var"#177#s" = KernelAbstractions.bs ÷ 2
                    while s > 0
[...]

Why is bs the only variable to be accessed as KernelAbstractions.bs rather than the locally-defined bs? That explains the undefined name and unsupported dynamic function (div).

Analog for var"#177#s" and s. I find the macros.jl source a bit hard to follow - I need to wrap my head around macros better - but would you know where to go next to fix these transformations?

Thanks for the quick reply, I'll be using @macroexpand in the future! This could go into a "Debugging Kernels" section in the docs..

Nov 16 '22 18:11 anicusan

So one thing to think through is what a while loop with a @synchronize inside should look like.

Nov 16 '22 20:11 vchuravy

Yeah the @synchronize makes while loops hard....

s = MVector(Int, length(wkgrp))
mask = map(s->s>0, s)
while any(mask)
   for tid in wkgrp
       mask[tid] || continue
       if tid < s[tid]
            cache[ti] += cache[ti + s[tid]]
        end
    end
    for tid in wkgrp
        mask[tid] || continue
        s[tid] = s[tid] >> 1
    end
    mask = map(s->s>0, s)
end

But we certainly couldn't support break

Nov 16 '22 20:11 vchuravy

We could solve break through introducing a mask... I like this direction, but it is something that the current architecture doesn't easily support.

Nov 16 '22 20:11 vchuravy