Synchronization in a nested while loop causes loss of threadID

Open leios opened this issue 4 years ago • 1 comments

I ran across a method that asked for me to use a shared memory pool of size blocksize and pull from it a few times in a for/while loop. I found that the CPU had problems with this. Here is a mwe (without the shmem shenanigans):

using Test
using CUDA
using CUDAKernels
using KernelAbstractions

@kernel function f_test_kernel!(a)
    tid = @index(Global, Linear)

    @uniform N = length(a)

    @uniform b = 0
    for i = 1:10
        if tid < N
            b += 1
            @synchronize()
        end
    end
end

a = zeros(1024)

# works
wait(f_test_kernel!(CUDADevice(),256)(CuArray(a), ndrange=1024))

# doesn't work
wait(f_test_kernel!(CPU(),4)(a, ndrange=1024))

Note: without the if statement, everything works fine. I also tried a few different nested if statements to see if a similar error occurred, but could not replicate it. It seems to be specifically a loop after a conditional (although maybe a loop in a loop would also trigger it? Still digging).

Error message (tid not defined):

ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] wait
   @ ~/projects/KernelAbstractions.jl/src/cpu.jl:65 [inlined]
 [3] wait (repeats 2 times)
   @ ~/projects/KernelAbstractions.jl/src/cpu.jl:29 [inlined]
 [4] top-level scope
   @ ~/projects/simuleios/histograms/mwe4.jl:22
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [6] top-level scope
   @ REPL[7]:1
 [7] top-level scope
   @ ~/.julia/packages/CUDA/YpW0k/src/initialization.jl:52

    nested task error: UndefVarError: tid not defined
    Stacktrace:
     [1] cpu_f_test_kernel!(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, ::Vector{Float64})
       @ ./none:0 [inlined]
     [2] overdub
       @ ./none:0 [inlined]
     [3] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, args::Tuple{Vector{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
       @ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:157
     [4] __run(obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, args::Tuple{Vector{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
       @ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:130
     [5] (::KernelAbstractions.var"#33#34"{Nothing, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, Tuple{Int64}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, Tuple{Vector{Float64}}, KernelAbstractions.NDIteration.NoDynamicCheck}})()
       @ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:22
in expression starting at /home/leios/projects/simuleios/histograms/mwe4.jl:22

I'll try my hand at it if I cannot find a workaround, but I figured I would create an issue here first.

Nov 01 '21 16:11 leios

The CPU kernel is:

    function cpu_f_test_kernel!(__ctx__, a; )
        let
            $(Expr(:aliasscope))
            begin
                N = length(input)
                a = 0
                var"##N#273" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
            end
            if tid < N
                for i = 1:10
                    begin
                        var"##N#275" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                        begin
                            #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:263 =#
                            for var"##I#274" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                                #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:264 =#
                                (KernelAbstractions.__validindex)(__ctx__, var"##I#274") || continue
                                #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:265 =#
                                tid = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#274")
                                #= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:266 =#
                                a += 1
                            end
                        end
                    end
                end
            end
            $(Expr(:popaliasscope))
            return nothing
        end
    end

https://github.com/JuliaGPU/KernelAbstractions.jl/blob/507f1bc173002901ce5e467f8fe46119d627a008/src/macros.jl#L164

The issue is with how @synchronize works, on the CPU it is a for-loopover the workitems and thus the scope of @synchronize determines the scope of @index.

Nov 01 '21 16:11 vchuravy