Tullio.jl
Tullio.jl copied to clipboard
Unsafe scatter with GPU arrays
From here: https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694/5
Tullio was detecting that i is unsafe to multi-thread, but wasn't passing this to KernelAbstractions:
using CUDA, Tullio, KernelAbstractions
let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
@tullio A[I[i]] += v[i]
end
#+RESULTS:
: 4-element CuArray{Float32,1}:
: 1.0
: 0.0
: 2.0
: 0.0
After my attempt at a fix in https://github.com/mcabbott/Tullio.jl/commit/4d30a1f2169844952a2174113f9b44615c6f3c7b (which runs on the CPU):
julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
@tullio A[I[i]] += v[i]
end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
[1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
[2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385
Besides not running, this is also too cautious, as it would apply to @tullio A[I[i]] = v[i] where it would be more justifiable to over-write -- there is no guaranteed order, but accumulation ought to be right.
@dfdx points out that ScatterNNlib.jl has an approach using atomic_add which may be worth trying to copy:
https://github.com/yuehhua/ScatterNNlib.jl/blob/74622bf40437f16e10468e8a35dae824527c83cf/src/cuda/cuarray.jl#L6-L17