CUDA.jl Broadcast uses wrong abstract interpreter, breaks device-only functionality

Cf. https://discourse.julialang.org/t/cuda-atomic-causes-type-instability/72224 Using CUDA.@atomic inside of a broadcast errors:

# ERROR with CUDA.@atomic
function g(test)
    broadcast(CuVector(1:2), transpose(CuVector(1:3))) do i,j
        v::Float32 = j
        CUDA.@atomic test[i] += v 
    end
end

g(CUDA.zeros(Float32,2))

ERROR: LoadError: GPU broadcast resulted in non-concrete element type Union{}.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] copy
   @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:44 [inlined]
 [3] materialize
   @ ./broadcast.jl:883 [inlined]
 [4] broadcast(::var"#45#46"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::LinearAlgebra.Transpose{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}})
   @ Base.Broadcast ./broadcast.jl:821
 [5] g(test::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})

It's convenient to use broadcast and sidestep the need to manually launch a kernel, call occupancy API, decide how to iterate on GPU, etc., so it would be nice if this would work in future. More generally, it would be nice to have a lightweight multi-dimensional allocation-free version of foreach that can sidestep @cuda in this way.

Nov 29 '21 18:11 guyvdbroeck

Looked into this, and the cause is surprising: Broadcast.combine_eltypes, which uses Base._return_type, uses the native AbstractInterpreter where CUDA.@atomic isn't usable and throws an error, hence the Union{} return type. Adding an interp argument to all those reflective functions seems cumbersome though; I wonder if we need a @with_interpreter that temporarily changes the global interpreter, but then only for reflection?

Alternatively, maybe the erroring host versions of device functions should return the same element type as their device counterparts (but then we're back at the annoying unknowable_true && error pattern).

cc @vchuravy, and maybe also @aviatesk who has put quite some thought in abstract interpreter composability already.

Nov 30 '21 08:11 maleadt

That is an excellent find! I think the short-term solution is for GPUArrays to use a custom combine_eltypes that uses the device interpreter.

I don't think something like with_interpeter (besides being the wrong term since we are not changing the interpreter xD) would work since we wouldn't want to change the AbsInt for the code in combine_eltypes, just the reflection method. So threading the AbsInt feels better

Nov 30 '21 14:11 vchuravy

CUDA.jl CUDA.jl copied to clipboard

Broadcast uses wrong abstract interpreter, breaks device-only functionality

CUDA.jl
CUDA.jl copied to clipboard