Issues with offset indices and GPU
Trying to implement a slightly weird "half-convolution" as follows:
using NNlib, Tullio
_tullio_conv(u, v) = @tullio y[i+_] := u[i+a] * v[a]
function tullio_conv(u::AbstractVector, v::AbstractVector)
# Left pad
upad = NNlib.pad_zeros(u, (length(v) - 1, 0))
v_rev = @view v[end:-1:1]
return _tullio_conv(upad, v_rev)
end
which results in "KernelAbstractions can't handle OffsetArrays here" when running tullio_conv(a, b) with a and b being CuArray of same size. Works fine on CPU though.
Not sure if this is Tullio.jl's or KernelAbstractions.jl's "fault".
Perhaps both? When last I checked, KernelAbstractions.jl needed indices starting at 1 here, and Tullio.jl applies the shift in y[i+_] := on the LHS, while otherwise letting i do what it likes. The kernel looks like this:
┌ Info: =====KA===== KernelAbstractions kernel
│ verbosetidy(kex1) =
│ quote
│ KernelAbstractions.@kernel function var"##🇨🇺#263"(ℛ::AbstractArray{𝒯}, u, v, 𝛥i, 𝒶𝓍i, 𝒶𝓍a, ♻️, 💀) where 𝒯
│ (i,) = @index(Global, NTuple)
│ begin
│ 𝒜𝒸𝒸 = if ♻️ === nothing
│ zero(𝒯)
│ else
│ ℛ[i + 𝛥i]
│ end
│ for a = 𝒶𝓍a
│ 𝒜𝒸𝒸 = 𝒜𝒸𝒸 + u[(i + a) - 1] * v[a]
│ end
│ ℛ[i + 𝛥i] = 𝒜𝒸𝒸
│ end
│ end
└ end
If it were smarter, it could move the 𝛥i to the RHS, and adjust the range of i. You can also do this yourself by writing @tullio y[i+_] := u[i+a-1] * v[a] verbose=true. (This will print out the index ranges, and verbose=2 the generated kernels etc.)
There is some handling for padding built-in, BTW, although whether this works on the GPU I'm not sure. Reversing v should also just be a sign change in the indices, I think:
julia> u = 1:5; v = [-1,0,10];
julia> _tullio_conv(u,v)
3-element Vector{Int64}:
29
38
47
julia> tullio_conv(u,v)
5-element Vector{Int64}:
-1
-2
7
16
25
julia> @tullio y[i] := u[pad(i-a+1,2,0)] * v[a]
5-element Vector{Int64}:
-1
-2
7
16
25
Note also that if you need gradients, they may be wrong until #86 fixes a bug, and (sadly) may not work on the GPU after that.
Hi, is this the same issue this code has?
using Tullio
using CUDA, CUDAKernels, KernelAbstractions
N = cu(rand(Float32, 10,10))
K = cu(rand(Float32, 3,3))
@tullio M[x+_,y+_] := N[x+i, y+j] * K[i,j]
gives ERROR: LoadError: "KernelAbstractions can't handle OffsetArrays here"
Is there a way to work around this?
Yes, it's the same. The manual fix is to change the indices to have ranges starting from 1:
julia> @tullio M[x+_,y+_] := N[x+i, y+j] * K[i,j] verbose=true
┌ Info: left index ranges
│ x = 0:7
└ y = 0:7
ERROR: "KernelAbstractions can't handle OffsetArrays here"
julia> @tullio M[x,y] := N[x+i-1, y+j-1] * K[i,j] verbose=true
┌ Info: left index ranges
│ x = 1:8
└ y = 1:8
The code being generated is pasted below, (x, y) = @index(Global, NTuple) is the "output" from ndrange = tuple(length(𝒶𝓍x), length(𝒶𝓍y)), and when last I checked it was a limitation of KernelAbstractions that these always start from 1.
As above, Tullio could be made to work around this, but I don't think it would be trivial. (In general it tries to treat OffsetArrays as first-class citizens, and allowing x to have any range it likes falls out of that viewpoint.)
┌ Info: =====KA===== KernelAbstractions kernel
│ verbosetidy(kex1) =
│ quote
│ KernelAbstractions.@kernel function var"##🇨🇺#278"(ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️, 💀) where 𝒯
│ (x, y) = @index(Global, NTuple)
│ begin
│ 𝒜𝒸𝒸 = if ♻️ === nothing
│ zero(𝒯)
│ else
│ ℛ[x + 𝛥x, y + 𝛥y]
│ end
│ for j = 𝒶𝓍j
│ for i = 𝒶𝓍i
│ 𝒜𝒸𝒸 = 𝒜𝒸𝒸 + N[x + i, y + j] * K[i, j]
│ end
│ end
│ ℛ[x + 𝛥x, y + 𝛥y] = 𝒜𝒸𝒸
│ end
│ end
└ end
┌ Info: =====KA===== KernelAbstractions CUDA actor
│ verbosetidy(kex2) =
│ quote
│ local @inline(function 𝒜𝒸𝓉!(::Type{<:CuArray}, ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️ = nothing, 💀 = true) where 𝒯
│ @info "running KernelAbstractions + CUDA actor " maxlog = 3 _id = 0x590ade10bed79747
│ cu_kern! = var"##🇨🇺#278"(CUDADevice())
│ (first)(𝒶𝓍x) == 1 || (throw)("KernelAbstractions can't handle OffsetArrays here")
│ (first)(𝒶𝓍y) == 1 || (throw)("KernelAbstractions can't handle OffsetArrays here")
│ 𝒜𝒸𝒸 = cu_kern!(ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️, 💀; ndrange = tuple(length(𝒶𝓍x), length(𝒶𝓍y)), workgroupsize = nothing, dependencies = Event(CUDADevice()))
│ KernelAbstractions.wait(CUDADevice(), 𝒜𝒸𝒸)
│ end)
└ end
I understand. It works with that. However, it now warns that "using KernelAbstractions with no outer indices, this will be slow". Is there something that can improve that?
This warning should be clearer that it's only about the gradient, but it seems like a tricky problem. If different parts of the range of x are done in parallel, then writing into 𝛥N[(x + i) - 1, (y + j) - 1] is unsafe, since different threads on nearby xs may have different is such that they collide.
The ultimate solution would presumably be to solve for x in terms of x_plus_i, and re-write expressions as loops over the x_plus_i appearing on the left (in the gradient) which would be safe to divide. But the range of i then stops being simple, for pixels near the edges; it's like dealing with padding.
julia> @tullio M[x,y] := N[x+i-1, y+j-1] * K[i,j] verbose=true
┌ Info: symbolic gradients
│ inbody =
│ 2-element Vector{Any}:
│ :(𝛥N[(x + i) - 1, (y + j) - 1] = 𝛥N[(x + i) - 1, (y + j) - 1] + 𝛥ℛ[x, y] * conj(K[i, j]))
└ :(𝛥K[i, j] = 𝛥K[i, j] + 𝛥ℛ[x, y] * conj(N[(x + i) - 1, (y + j) - 1]))
┌ Warning: using KernelAbstractions with no outer indices, this will be slow
└ @ Tullio ~/.julia/dev/Tullio/src/macro.jl:1135