NNlib.jl `gather` is not friendly with matrix of size 0 on GPU

using NNlib, CUDA
julia>  NNlib.gather(rand(0,32),[2,3,4]) #on CPU
0×3 Matrix{Float64} 


julia> a = rand(0,32)|>gpu
0×32 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

julia> idx = cu[2,4,6]
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 2
 4
 6

julia>  NNlib.gather(a,idx)
ERROR: DivideError: integer division error
Stacktrace:
 [1] div
   @ .\int.jl:284 [inlined]
 [2] div
   @ .\div.jl:257 [inlined]
 [3] div
   @ .\div.jl:312 [inlined]
 [4] cld
   @ .\div.jl:269 [inlined]
 [5] gather!(dst::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
   @ NNlibCUDA C:\Users\Luffy\.julia\packages\NNlibCUDA\vECff\src\gather.jl:62
 [6] gather(src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:77
 [7] top-level scope
   @ REPL[135]:1
 [8] top-level scope
   @ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\initialization.jl:52

May 31 '22 00:05 YichengDWu

Looks like gather allocates an empty array of the right size:

https://github.com/FluxML/NNlib.jl/blob/master/src/gather.jl#L76

so this can probably be fixed by adding a short-circuit like isempty(dst) && return dst in gather!, before it launches kernels?

https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/gather.jl#L52

Should be an easy PR you're interested. Would want tests in this file:

https://github.com/FluxML/NNlibCUDA.jl/blob/master/test/gather.jl

May 31 '22 12:05 mcabbott

so this can probably be fixed by adding a short-circuit like isempty(dst) && return dst in gather!, before it launches kernels?

We need to check max(index)<=size(src)[end-M:end]. This is missing in general if index is on GPU. Or we can simply write

if size(srt,1)==0
    NNlib.gather(srt,idx_gpu) = NNlib.gather(srt,cpu(idx_gpu))
end

This works fine but is not ideal

May 31 '22 18:05 YichengDWu

Maybe it's a good idea to remove the @inbounds macro in https://github.com/FluxML/NNlibCUDA.jl/blob/fb6fe8efa4764e989d4a328232433ca0fde129bd/src/gather.jl#L32 and https://github.com/FluxML/NNlibCUDA.jl/blob/fb6fe8efa4764e989d4a328232433ca0fde129bd/src/gather.jl#L43

After doing this, we have the desired error info

10×1 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
Error showing value of type CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
ERROR: KernelException: exception thrown during kernel execution on device NVIDIA GeForce RTX 3070 Laptop GPU
Stacktrace:

May 31 '22 18:05 YichengDWu

@yuehhua

Jun 01 '22 00:06 YichengDWu

It's not good to remove @inbounds in GPU kernel. Dimensions of CuArray should be checked outside kernel such that kernel should work properly. I agree with the idea from @mcabbott.

Jun 01 '22 01:06 yuehhua

So the problem is we need to add a bounds checking function here, and make it compatible with empty arrays

Jun 01 '22 17:06 YichengDWu

In the beginning, gather is not considered to accept empty array. The CPU case is a coincidence. If it is intended to be compatible with GPU, returns a empty array is reasonable. If throwing an error is expected, just check if an empty array is received and throw the error out. You don't need to deal with bound check. The empty array input is root cause, and the indexing is derived issue.

Jun 02 '22 00:06 yuehhua

The CPU case is a coincidence

It is a coincidence we should avoid. Using NNlib alone is not calling NNlibCUDA

using NNlib, CUDA

src = CUDA.rand(2,3)

NNlib.gather(src,cu[1,4])

ERROR: BoundsError: attempt to access 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer} at index [1:2, 4]
Stacktrace:
 [1] throw_boundserror(A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, I::Tuple{Base.Slice{Base.OneTo{Int64}}, Int64})
   @ Base .\abstractarray.jl:691
 [2] checkbounds
   @ .\abstractarray.jl:656 [inlined]
 [3] view
   @ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\array.jl:617 [inlined]
 [4] _view(X::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, colons::Tuple{Colon}, k::Int64)
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\scatter.jl:38
 [5] gather!(dst::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:27
 [6] gather(src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:77
 [7] top-level scope
   @ c:\Users\Luffy\gather_test.jl:5

unless we write

using NNlib, CUDA
using NNlibCUDA

src = CUDA.rand(2,3)

NNlib.gather(src,cu[1,4])

2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.430532  0.0
 0.474528  0.0

Now we have the bounds checking thing. This causes problems downstream as in https://github.com/CarloLucibello/GraphNeuralNetworks.jl/issues/181. Now it's more like a bug.

the indexing is derived issue

As you can see above, bounds checking is a separate issue, empty array is another, and you need to deal with both.

Jun 02 '22 01:06 YichengDWu

Oh! Now I get your point.

Jun 02 '22 01:06 yuehhua

The so-called coincidence may be the third issue. For instance, if I'm using Flux.jl, which imports NNlibCUDA, then I have no idea I should always put the index on GPU. Very easily, I could write something like

NNlib.gather(srt,[2,3,4])

It won't throw an error since under the hood we are calling NNlib.gather! not NNlibCUDA.gather!. Assuming everything is fine with NNlibCUDA.gather!, then either we should automatically move the index to GPU when NNlibCUDA.gather! is in the namespace (it should always be there since we already know srt is on GPU) or it throws an error like "srt is on GPU, but idx is on CPU"

Jun 02 '22 02:06 YichengDWu

For these points, you can fire corresponding issues.

Jun 02 '22 02:06 yuehhua

#416 and #415

Jun 02 '22 02:06 YichengDWu