Tullio.jl icon indicating copy to clipboard operation
Tullio.jl copied to clipboard

Unsafe scatter with GPU arrays

Open mcabbott opened this issue 4 years ago • 0 comments

From here: https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694/5

Tullio was detecting that i is unsafe to multi-thread, but wasn't passing this to KernelAbstractions:

using CUDA, Tullio, KernelAbstractions

let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
    @tullio A[I[i]] += v[i]
end

#+RESULTS:
: 4-element CuArray{Float32,1}:
:  1.0
:  0.0
:  2.0
:  0.0

After my attempt at a fix in https://github.com/mcabbott/Tullio.jl/commit/4d30a1f2169844952a2174113f9b44615c6f3c7b (which runs on the CPU):

julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
           @tullio A[I[i]] += v[i]
       end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
 [1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385

Besides not running, this is also too cautious, as it would apply to @tullio A[I[i]] = v[i] where it would be more justifiable to over-write -- there is no guaranteed order, but accumulation ought to be right.

@dfdx points out that ScatterNNlib.jl has an approach using atomic_add which may be worth trying to copy:

https://github.com/yuehhua/ScatterNNlib.jl/blob/74622bf40437f16e10468e8a35dae824527c83cf/src/cuda/cuarray.jl#L6-L17

mcabbott avatar Nov 12 '20 07:11 mcabbott