CUDA.jl
CUDA.jl copied to clipboard
Broadcast uses wrong abstract interpreter, breaks device-only functionality
Cf. https://discourse.julialang.org/t/cuda-atomic-causes-type-instability/72224
Using CUDA.@atomic
inside of a broadcast errors:
# ERROR with CUDA.@atomic
function g(test)
broadcast(CuVector(1:2), transpose(CuVector(1:3))) do i,j
v::Float32 = j
CUDA.@atomic test[i] += v
end
end
g(CUDA.zeros(Float32,2))
ERROR: LoadError: GPU broadcast resulted in non-concrete element type Union{}.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] copy
@ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:44 [inlined]
[3] materialize
@ ./broadcast.jl:883 [inlined]
[4] broadcast(::var"#45#46"{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::LinearAlgebra.Transpose{Int64, CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}})
@ Base.Broadcast ./broadcast.jl:821
[5] g(test::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
It's convenient to use broadcast
and sidestep the need to manually launch a kernel, call occupancy API, decide how to iterate on GPU, etc., so it would be nice if this would work in future. More generally, it would be nice to have a lightweight multi-dimensional allocation-free version of foreach
that can sidestep @cuda
in this way.
Looked into this, and the cause is surprising: Broadcast.combine_eltypes
, which uses Base._return_type
, uses the native AbstractInterpreter where CUDA.@atomic
isn't usable and throws an error, hence the Union{}
return type. Adding an interp
argument to all those reflective functions seems cumbersome though; I wonder if we need a @with_interpreter
that temporarily changes the global interpreter, but then only for reflection?
Alternatively, maybe the erroring host versions of device functions should return the same element type as their device counterparts (but then we're back at the annoying unknowable_true && error
pattern).
cc @vchuravy, and maybe also @aviatesk who has put quite some thought in abstract interpreter composability already.
That is an excellent find! I think the short-term solution is for GPUArrays to use a custom combine_eltypes
that uses the device interpreter.
I don't think something like with_interpeter
(besides being the wrong term since we are not changing the interpreter xD) would work since we wouldn't want to change the AbsInt for the code in combine_eltypes
, just the reflection method. So threading the AbsInt feels better