KernelAbstractions.jl Variable scoping issue leading to unexpected UndefVarError on CPU

I have encountered what I think is a variable scoping issue that causes one of my KernelAbstractions kernels to fail when executing on the CPU. (GPU execution is fine.) I'm using KernelAbstractions v0.9.6 in Julia 1.9.2. Here's a minimal example that triggers the problem:

using KernelAbstractions

@kernel function mykernel(x)
    i = @index(Global, Linear)
    _, Nblocks = @ndrange()

    @inbounds begin
        id = Nblocks

        @synchronize

        x[i] = 1.0
    end
end

x = ones(256, 1)
backend = get_backend(x)
kernel! = mykernel(backend, (256,))
kernel!(x, ndrange = (256, 1))

When I run this code, it fails with:

ERROR: LoadError: UndefVarError: `Nblocks` not defined
Stacktrace:
 [1] cpu_mykernel
   @ ~/.julia/packages/KernelAbstractions/lhhMo/src/macros.jl:276 [inlined]
 [2] cpu_mykernel(__ctx__::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{2}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256, 1)}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}}, x::Matrix{Float64})
   @ Main ./none:0
 [3] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_mykernel)}, ndrange::Tuple{Int64, Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256, 1)}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}, args::Tuple{Matrix{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
   @ KernelAbstractions ~/.julia/packages/KernelAbstractions/lhhMo/src/cpu.jl:115
 [4] __run(obj::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_mykernel)}, ndrange::Tuple{Int64, Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(256, 1)}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Nothing}, args::Tuple{Matrix{Float64}}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck, static_threads::Bool)
   @ KernelAbstractions ~/.julia/packages/KernelAbstractions/lhhMo/src/cpu.jl:82
 [5] (::KernelAbstractions.Kernel{CPU, KernelAbstractions.NDIteration.StaticSize{(256,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(cpu_mykernel)})(args::Matrix{Float64}; ndrange::Tuple{Int64, Int64}, workgroupsize::Nothing)
   @ KernelAbstractions ~/.julia/packages/KernelAbstractions/lhhMo/src/cpu.jl:44
 [6] top-level scope
   @ ~/debug.jl:19
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:478
 [8] top-level scope
   @ REPL[1]:1

The compiler thinks that the variable Nblocks in the id = Nblocks line is not defined, even though it clearly is defined via the call to @ndrange. When I inspect the generated kernel code with code_lowered(), I see:

julia> code_lowered(kernel!.f)
1-element Vector{Core.CodeInfo}:
 CodeInfo(
[...]
5 ──       i@_14 = KernelAbstractions.__index_Global_Linear(__ctx__, I#301)
│    %25 = (KernelAbstractions.ndrange)(__ctx__)
│    %26 = (size)(%25)
│    %27 = Base.indexed_iterate(%26, 1)
│          Core.getfield(%27, 1)
│          @_11 = Core.getfield(%27, 2)
│    %30 = Base.indexed_iterate(%26, 2, @_11)
└───       Nblocks = Core.getfield(%30, 1)
[...]
12 ─       i@_17 = KernelAbstractions.__index_Global_Linear(__ctx__, I#303)
└───       id = Main.Nblocks
[...]
)

The code in block 5 shows that Nblocks is getting set OK, but the code in block 12 shows that when the id = Nblocks line gets translated, the compiler looks for a definition of Nblocks in the Main module, where it does not exist. (I redacted this listing for readability. I'm happy to provide the full listing if that would be helpful.)

The issue disappears if I remove the call to @synchronize.

Any thoughts here?

EDIT: This is probably related to (maybe even a duplicate of) #274. Also, another way I can get the issue to disappear is to move the call to @ndrange that defines Nblocks inside the @inbounds begin ... end block.

Aug 02 '23 17:08 aaustin141

Yeah this is expected and the reason why the @uniform macro is needed.

https://juliagpu.github.io/KernelAbstractions.jl/api/#KernelAbstractions.@uniform

Aug 02 '23 17:08 vchuravy

OK, yeah---using @uniform fixes it. I guess what confuses me here is that I didn't need to do that for the variables declared with @index. (Indeed, putting indices inside a @uniform block triggers a different error.) But I can work with that---thanks!

Aug 02 '23 17:08 aaustin141

Yeah the CPU lowering is a bit tricky, and doesn't have the best errors

Aug 02 '23 17:08 vchuravy

Hey, I need help here please!

No matter if I add or remove @uniform or @private in the index, I can't run the code:

LoadError: UndefVarError: index not defined:

	index = @index(Global)

	@uniform tid = index - 1

LoadError: MethodError: no method matching __index_Global_Linear(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{2}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}):

index = @uniform @index(Global)
@uniform tid = index - 1

ERROR: LoadError: MethodError: no method matching __index_Global_Linear(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{2}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{2, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}}}):

@uniform index = @index(Global)
@uniform tid = index - 1

Any ideas? Thanks!

Feb 29 '24 17:02 ManuelCostanzo

@ManuelCostanzo please make it easier to help you by formatting your post.

I am unsure what you want to achieve? By definition @uniform and @index are incompatible.

Feb 29 '24 19:02 vchuravy

Hi @vchuravy, I just need to access to the thread_id and block_id values after the @synchronize. In CPU, the only way (as far as I know) is to create the variables with @uniform or @private. So, It's impossible to me to do that because I'm getting the error "ID IS NOT DEFINED". If I don't add @uniform, I get "VARR IS NOT DEFINED" Here is an example code:

using KernelAbstractions, CUDA

const backend = CPU()
const BLOCK_SIZE = 4



@kernel function kk(A, B, C)
       id = @index(Global) # I tried using @uniform and @private
	@uniform varr = id # I tried using @private too
	for i in 1:varr
		for j in 1:varr
			@synchronize()
			C[i, j] = 0
			for k in 1:varr
				C[i, j] += A[i, k] * B[k, j]
			end
		end
	end
end


function run_gpu()
	m = 10
	n = 20

	#Inicializo las matrices en la GPU
	A = KernelAbstractions.zeros(backend, Int, m, n)
	B = KernelAbstractions.zeros(backend, Int, m, n)
	C = KernelAbstractions.zeros(backend, Int, m, n)

	#Calculo el tamaño de bloque
	block_size = BLOCK_SIZE
	mn = max(m, n)
	if mn < BLOCK_SIZE
		block_size = mn
	end

	#Calculo el número de bloques
	total_blocks = (mn + block_size - 1) ÷ block_size


	#Anti-diagonal loop
	@time @inbounds for diag in 0:(2*total_blocks-1)
		#Número de bloques a lanzar en la anti-diagonal
		num_blocks_diagonal = min(diag + 1, 2 * total_blocks - diag - 1)
		kernel! = kk(backend)
		kernel!(A, B, C, ndrange = (block_size * block_size, num_blocks_diagonal), workgroupsize = block_size * block_size)
		KernelAbstractions.synchronize(backend)
	end

end


run_gpu()

Mar 02 '24 03:03 ManuelCostanzo

KernelAbstractions.jl KernelAbstractions.jl copied to clipboard

Variable scoping issue leading to unexpected UndefVarError on CPU

KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard