Tullio.jl Issues with offset indices and GPU

Trying to implement a slightly weird "half-convolution" as follows:

using NNlib, Tullio

_tullio_conv(u, v) = @tullio y[i+_] := u[i+a] * v[a]


function tullio_conv(u::AbstractVector, v::AbstractVector)
    # Left pad
    upad = NNlib.pad_zeros(u, (length(v) - 1, 0))
    v_rev = @view v[end:-1:1]

    return _tullio_conv(upad, v_rev)
end

which results in "KernelAbstractions can't handle OffsetArrays here" when running tullio_conv(a, b) with a and b being CuArray of same size. Works fine on CPU though.

Not sure if this is Tullio.jl's or KernelAbstractions.jl's "fault".

Mar 15 '21 12:03 torfjelde

Perhaps both? When last I checked, KernelAbstractions.jl needed indices starting at 1 here, and Tullio.jl applies the shift in y[i+_] := on the LHS, while otherwise letting i do what it likes. The kernel looks like this:

┌ Info: =====KA===== KernelAbstractions kernel 
│   verbosetidy(kex1) =
│    quote
│        KernelAbstractions.@kernel function var"##🇨🇺#263"(ℛ::AbstractArray{𝒯}, u, v, 𝛥i, 𝒶𝓍i, 𝒶𝓍a, ♻️, 💀) where 𝒯
│                (i,) = @index(Global, NTuple)
│                begin
│                    𝒜𝒸𝒸 = if ♻️ === nothing
│                            zero(𝒯)
│                        else
│                            ℛ[i + 𝛥i]
│                        end
│                    for a = 𝒶𝓍a
│                        𝒜𝒸𝒸 = 𝒜𝒸𝒸 + u[(i + a) - 1] * v[a]
│                    end
│                    ℛ[i + 𝛥i] = 𝒜𝒸𝒸
│                end
│            end
└    end

If it were smarter, it could move the 𝛥i to the RHS, and adjust the range of i. You can also do this yourself by writing @tullio y[i+_] := u[i+a-1] * v[a] verbose=true. (This will print out the index ranges, and verbose=2 the generated kernels etc.)

There is some handling for padding built-in, BTW, although whether this works on the GPU I'm not sure. Reversing v should also just be a sign change in the indices, I think:

julia> u = 1:5; v = [-1,0,10];

julia> _tullio_conv(u,v)
3-element Vector{Int64}:
 29
 38
 47

julia> tullio_conv(u,v)
5-element Vector{Int64}:
 -1
 -2
  7
 16
 25

julia> @tullio y[i] := u[pad(i-a+1,2,0)] * v[a]
5-element Vector{Int64}:
 -1
 -2
  7
 16
 25

Note also that if you need gradients, they may be wrong until #86 fixes a bug, and (sadly) may not work on the GPU after that.

Mar 15 '21 12:03 mcabbott

Hi, is this the same issue this code has?

using Tullio
using CUDA, CUDAKernels, KernelAbstractions
N = cu(rand(Float32, 10,10))
K = cu(rand(Float32, 3,3))
@tullio M[x+_,y+_] := N[x+i, y+j] * K[i,j]

gives ERROR: LoadError: "KernelAbstractions can't handle OffsetArrays here"

Is there a way to work around this?

May 13 '21 08:05 cdsousa

Yes, it's the same. The manual fix is to change the indices to have ranges starting from 1:

julia> @tullio M[x+_,y+_] := N[x+i, y+j] * K[i,j] verbose=true
┌ Info: left index ranges
│   x = 0:7
└   y = 0:7
ERROR: "KernelAbstractions can't handle OffsetArrays here"

julia> @tullio M[x,y] := N[x+i-1, y+j-1] * K[i,j] verbose=true
┌ Info: left index ranges
│   x = 1:8
└   y = 1:8

The code being generated is pasted below, (x, y) = @index(Global, NTuple) is the "output" from ndrange = tuple(length(𝒶𝓍x), length(𝒶𝓍y)), and when last I checked it was a limitation of KernelAbstractions that these always start from 1.

As above, Tullio could be made to work around this, but I don't think it would be trivial. (In general it tries to treat OffsetArrays as first-class citizens, and allowing x to have any range it likes falls out of that viewpoint.)

┌ Info: =====KA===== KernelAbstractions kernel 
│   verbosetidy(kex1) =
│    quote
│        KernelAbstractions.@kernel function var"##🇨🇺#278"(ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️, 💀) where 𝒯
│                (x, y) = @index(Global, NTuple)
│                begin
│                    𝒜𝒸𝒸 = if ♻️ === nothing
│                            zero(𝒯)
│                        else
│                            ℛ[x + 𝛥x, y + 𝛥y]
│                        end
│                    for j = 𝒶𝓍j
│                        for i = 𝒶𝓍i
│                            𝒜𝒸𝒸 = 𝒜𝒸𝒸 + N[x + i, y + j] * K[i, j]
│                        end
│                    end
│                    ℛ[x + 𝛥x, y + 𝛥y] = 𝒜𝒸𝒸
│                end
│            end
└    end

┌ Info: =====KA===== KernelAbstractions CUDA actor 
│   verbosetidy(kex2) =
│    quote
│        local @inline(function 𝒜𝒸𝓉!(::Type{<:CuArray}, ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️ = nothing, 💀 = true) where 𝒯
│                    @info "running KernelAbstractions + CUDA actor " maxlog = 3 _id = 0x590ade10bed79747
│                    cu_kern! = var"##🇨🇺#278"(CUDADevice())
│                    (first)(𝒶𝓍x) == 1 || (throw)("KernelAbstractions can't handle OffsetArrays here")
│                    (first)(𝒶𝓍y) == 1 || (throw)("KernelAbstractions can't handle OffsetArrays here")
│                    𝒜𝒸𝒸 = cu_kern!(ℛ::AbstractArray{𝒯}, N, K, 𝛥x, 𝛥y, 𝒶𝓍x, 𝒶𝓍y, 𝒶𝓍i, 𝒶𝓍j, ♻️, 💀; ndrange = tuple(length(𝒶𝓍x), length(𝒶𝓍y)), workgroupsize = nothing, dependencies = Event(CUDADevice()))
│                    KernelAbstractions.wait(CUDADevice(), 𝒜𝒸𝒸)
│                end)
└    end

May 13 '21 12:05 mcabbott

I understand. It works with that. However, it now warns that "using KernelAbstractions with no outer indices, this will be slow". Is there something that can improve that?

May 13 '21 15:05 cdsousa

This warning should be clearer that it's only about the gradient, but it seems like a tricky problem. If different parts of the range of x are done in parallel, then writing into 𝛥N[(x + i) - 1, (y + j) - 1] is unsafe, since different threads on nearby xs may have different is such that they collide.

The ultimate solution would presumably be to solve for x in terms of x_plus_i, and re-write expressions as loops over the x_plus_i appearing on the left (in the gradient) which would be safe to divide. But the range of i then stops being simple, for pixels near the edges; it's like dealing with padding.

julia> @tullio M[x,y] := N[x+i-1, y+j-1] * K[i,j] verbose=true
┌ Info: symbolic gradients
│   inbody =
│    2-element Vector{Any}:
│     :(𝛥N[(x + i) - 1, (y + j) - 1] = 𝛥N[(x + i) - 1, (y + j) - 1] + 𝛥ℛ[x, y] * conj(K[i, j]))
└     :(𝛥K[i, j] = 𝛥K[i, j] + 𝛥ℛ[x, y] * conj(N[(x + i) - 1, (y + j) - 1]))
┌ Warning: using KernelAbstractions with no outer indices, this will be slow
└ @ Tullio ~/.julia/dev/Tullio/src/macro.jl:1135

May 13 '21 15:05 mcabbott