CUDA.jl @atomic is fragile due to bypassing getindex machinery

Compare

julia> @cuda ((a)->(a[2] = 42; nothing))(CuArray{Int}(undef, 1))

julia> using CUDAdrv

julia> CUDAdrv.synchronize()
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
Stacktrace:
 [1] macro expansion at /home/tim/Julia/pkg/CUDAdrv/src/base.jl:145 [inlined]
 [2] synchronize(::CuContext) at /home/tim/Julia/pkg/CUDAdrv/src/context.jl:196 (repeats 2 times)
 [3] top-level scope at REPL[8]:1

vs

julia> @cuda ((a)->(CUDAnative.@atomic a[2] += 42; nothing))(CuArray{Int}(undef, 1))

julia> CUDAdrv.synchronize()

julia>

Jun 15 '19 09:06 maleadt

A similar case as reported in https://discourse.julialang.org/t/how-to-parallerize-dual-coordinet-descent-mehods-on-gpu-using-cuda-jl/40344/6

using CUDA, CUDA.CUSPARSE
using LinearAlgebra, SparseArrays

numsample = 10
numfeature = 5
X⊤ = sprandn(numfeature, numsample, 0.5)

X⊤_colptr = CuVector{Int32}(X⊤.colptr)
X⊤_rowval = CuVector{Int32}(X⊤.rowval)
X⊤_nzval = CuVector{Float32}(X⊤.nzval)
α = CuVector{Float32}(randn(numsample))
d = CuVector{Float32}(X⊤ * α)

function update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # calc xᵢ⊤d
    xᵢ⊤d = 0.0
    from = X⊤_colptr[i]
    to = X⊤_colptr[i+1] - 1
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        xᵢ⊤d += Xᵢⱼ * d[j]
    end

    # update αᵢ
    Δαᵢ = 0.1 * xᵢ⊤d    # dummy
    α[i] += Δαᵢ

    # update d
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        @atomic d[j] += Δαᵢ * Xᵢⱼ
    end
end

@cuda threads=numsample update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)

This fails while dropping the atomic macro works, because of the Int32 index and Float32/Float64 value mismatch. Workaround: @atomic d[Int(j)] += Float32(Δαᵢ * Xᵢⱼ)

May 29 '20 18:05 maleadt

I think I'm seeing a different symptom of the same issue. It looks like using begin and end don't work with CUDA.@atomic ATM (CUDA.jl v3.8.0):

julia> @macroexpand CUDA.@atomic hist[max(begin, min(bin, end))] += 1
quote
    #= /home/tkf/.julia/dev/CUDA/src/device/intrinsics/atomics.jl:438 =#
    (CUDA.atomic_arrayset)(hist, (max(begin, min(bin, end)),), +, 1)
end

Feb 05 '22 11:02 tkf

I'm exploring the extension of @atomic in https://github.com/JuliaConcurrent/AtomicArrays.jl which implements AtomicArrays.@atomic as a superset of Base.@atomic (https://github.com/JuliaConcurrent/AtomicArrays.jl/pull/2). With https://github.com/JuliaConcurrent/AtomicArrays.jl/pull/3 it supports CUDA.jl. The route I took was to add an API AtomicArrays.asref(A::AbstractArray{T}) -> B::AbstractArray{AtomicRef{T}}. This way, we can let the usual indexing mechanism to lower the indexing part of @atomic A[max(begin, min(i, end))] += 1.

Feb 22 '22 12:02 tkf

Sound great, we can have CUDA.jl depend on AtomicArrays.jl (and re-export @atomic) once it matures a little :slightly_smiling_face:

Feb 23 '22 08:02 maleadt

I actually want to shove this into Base (at least as Experimental.AtomicArray or something) once we get enough intrinsics https://github.com/JuliaLang/julia/pull/43809

I just thought to share AtomicArrays.jl to get more feedback on the interface and lowering. If there's no apparent problem, I think I'll use this approach for the future PR to Julia. But if you want to backport this to old Julia, I can release AtomicArrays.jl (it's not even released) or maybe move it to Compat.jl.

Feb 23 '22 08:02 tkf

But if you want to backport this to old Julia, I can release AtomicArrays.jl (it's not even released) or maybe move it to Compat.jl.

No, not necessarily, it's fine to have this only on the next Julia version.

Feb 23 '22 09:02 maleadt

Compat.jl would be nice, but I agree with Tim. We should add this to Base

Feb 23 '22 13:02 vchuravy