Synchronization in a nested while loop causes loss of threadID
I ran across a method that asked for me to use a shared memory pool of size blocksize and pull from it a few times in a for/while loop. I found that the CPU had problems with this. Here is a mwe (without the shmem shenanigans):
using Test
using CUDA
using CUDAKernels
using KernelAbstractions
@kernel function f_test_kernel!(a)
tid = @index(Global, Linear)
@uniform N = length(a)
@uniform b = 0
for i = 1:10
if tid < N
b += 1
@synchronize()
end
end
end
a = zeros(1024)
# works
wait(f_test_kernel!(CUDADevice(),256)(CuArray(a), ndrange=1024))
# doesn't work
wait(f_test_kernel!(CPU(),4)(a, ndrange=1024))
Note: without the if statement, everything works fine. I also tried a few different nested if statements to see if a similar error occurred, but could not replicate it. It seems to be specifically a loop after a conditional (although maybe a loop in a loop would also trigger it? Still digging).
Error message (tid not defined):
ERROR: LoadError: TaskFailedException
Stacktrace:
[1] wait
@ ./task.jl:322 [inlined]
[2] wait
@ ~/projects/KernelAbstractions.jl/src/cpu.jl:65 [inlined]
[3] wait (repeats 2 times)
@ ~/projects/KernelAbstractions.jl/src/cpu.jl:29 [inlined]
[4] top-level scope
@ ~/projects/simuleios/histograms/mwe4.jl:22
[5] include(fname::String)
@ Base.MainInclude ./client.jl:444
[6] top-level scope
@ REPL[7]:1
[7] top-level scope
@ ~/.julia/packages/CUDA/YpW0k/src/initialization.jl:52
nested task error: UndefVarError: tid not defined
Stacktrace:
[1] cpu_f_test_kernel!(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, ::Vector{Float64})
@ ./none:0 [inlined]
[2] overdub
@ ./none:0 [inlined]
[3] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, args::Tuple{Vector{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
@ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:157
[4] __run(obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, args::Tuple{Vector{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
@ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:130
[5] (::KernelAbstractions.var"#33#34"{Nothing, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(4,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_f_test_kernel!)}, Tuple{Int64}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(4,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}, Tuple{Vector{Float64}}, KernelAbstractions.NDIteration.NoDynamicCheck}})()
@ KernelAbstractions ~/projects/KernelAbstractions.jl/src/cpu.jl:22
in expression starting at /home/leios/projects/simuleios/histograms/mwe4.jl:22
I'll try my hand at it if I cannot find a workaround, but I figured I would create an issue here first.
The CPU kernel is:
function cpu_f_test_kernel!(__ctx__, a; )
let
$(Expr(:aliasscope))
begin
N = length(input)
a = 0
var"##N#273" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
end
if tid < N
for i = 1:10
begin
var"##N#275" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
begin
#= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:263 =#
for var"##I#274" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
#= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:264 =#
(KernelAbstractions.__validindex)(__ctx__, var"##I#274") || continue
#= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:265 =#
tid = KernelAbstractions.__index_Global_Linear(__ctx__, var"##I#274")
#= /home/vchuravy/.julia/packages/KernelAbstractions/8W8KX/src/macros.jl:266 =#
a += 1
end
end
end
end
end
$(Expr(:popaliasscope))
return nothing
end
end
https://github.com/JuliaGPU/KernelAbstractions.jl/blob/507f1bc173002901ce5e467f8fe46119d627a008/src/macros.jl#L164
The issue is with how @synchronize works, on the CPU it is a for-loopover the workitems and thus the scope of @synchronize determines the scope of @index.