@atomic is fragile due to bypassing getindex machinery
Compare
julia> @cuda ((a)->(a[2] = 42; nothing))(CuArray{Int}(undef, 1))
julia> using CUDAdrv
julia> CUDAdrv.synchronize()
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
Stacktrace:
[1] macro expansion at /home/tim/Julia/pkg/CUDAdrv/src/base.jl:145 [inlined]
[2] synchronize(::CuContext) at /home/tim/Julia/pkg/CUDAdrv/src/context.jl:196 (repeats 2 times)
[3] top-level scope at REPL[8]:1
vs
julia> @cuda ((a)->(CUDAnative.@atomic a[2] += 42; nothing))(CuArray{Int}(undef, 1))
julia> CUDAdrv.synchronize()
julia>
A similar case as reported in https://discourse.julialang.org/t/how-to-parallerize-dual-coordinet-descent-mehods-on-gpu-using-cuda-jl/40344/6
using CUDA, CUDA.CUSPARSE
using LinearAlgebra, SparseArrays
numsample = 10
numfeature = 5
X⊤ = sprandn(numfeature, numsample, 0.5)
X⊤_colptr = CuVector{Int32}(X⊤.colptr)
X⊤_rowval = CuVector{Int32}(X⊤.rowval)
X⊤_nzval = CuVector{Float32}(X⊤.nzval)
α = CuVector{Float32}(randn(numsample))
d = CuVector{Float32}(X⊤ * α)
function update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
# calc xᵢ⊤d
xᵢ⊤d = 0.0
from = X⊤_colptr[i]
to = X⊤_colptr[i+1] - 1
for elindex in from:to
j = X⊤_rowval[elindex]
Xᵢⱼ = X⊤_nzval[elindex]
xᵢ⊤d += Xᵢⱼ * d[j]
end
# update αᵢ
Δαᵢ = 0.1 * xᵢ⊤d # dummy
α[i] += Δαᵢ
# update d
for elindex in from:to
j = X⊤_rowval[elindex]
Xᵢⱼ = X⊤_nzval[elindex]
@atomic d[j] += Δαᵢ * Xᵢⱼ
end
end
@cuda threads=numsample update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
This fails while dropping the atomic macro works, because of the Int32 index and Float32/Float64 value mismatch. Workaround: @atomic d[Int(j)] += Float32(Δαᵢ * Xᵢⱼ)
I think I'm seeing a different symptom of the same issue. It looks like using begin and end don't work with CUDA.@atomic ATM (CUDA.jl v3.8.0):
julia> @macroexpand CUDA.@atomic hist[max(begin, min(bin, end))] += 1
quote
#= /home/tkf/.julia/dev/CUDA/src/device/intrinsics/atomics.jl:438 =#
(CUDA.atomic_arrayset)(hist, (max(begin, min(bin, end)),), +, 1)
end
I'm exploring the extension of @atomic in https://github.com/JuliaConcurrent/AtomicArrays.jl which implements AtomicArrays.@atomic as a superset of Base.@atomic (https://github.com/JuliaConcurrent/AtomicArrays.jl/pull/2). With https://github.com/JuliaConcurrent/AtomicArrays.jl/pull/3 it supports CUDA.jl. The route I took was to add an API AtomicArrays.asref(A::AbstractArray{T}) -> B::AbstractArray{AtomicRef{T}}. This way, we can let the usual indexing mechanism to lower the indexing part of @atomic A[max(begin, min(i, end))] += 1.
Sound great, we can have CUDA.jl depend on AtomicArrays.jl (and re-export @atomic) once it matures a little :slightly_smiling_face:
I actually want to shove this into Base (at least as Experimental.AtomicArray or something) once we get enough intrinsics https://github.com/JuliaLang/julia/pull/43809
I just thought to share AtomicArrays.jl to get more feedback on the interface and lowering. If there's no apparent problem, I think I'll use this approach for the future PR to Julia. But if you want to backport this to old Julia, I can release AtomicArrays.jl (it's not even released) or maybe move it to Compat.jl.
But if you want to backport this to old Julia, I can release AtomicArrays.jl (it's not even released) or maybe move it to Compat.jl.
No, not necessarily, it's fine to have this only on the next Julia version.
Compat.jl would be nice, but I agree with Tim. We should add this to Base