`gather` is not friendly with matrix of size 0 on GPU
using NNlib, CUDA
julia> NNlib.gather(rand(0,32),[2,3,4]) #on CPU
0×3 Matrix{Float64}
julia> a = rand(0,32)|>gpu
0×32 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
julia> idx = cu[2,4,6]
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
2
4
6
julia> NNlib.gather(a,idx)
ERROR: DivideError: integer division error
Stacktrace:
[1] div
@ .\int.jl:284 [inlined]
[2] div
@ .\div.jl:257 [inlined]
[3] div
@ .\div.jl:312 [inlined]
[4] cld
@ .\div.jl:269 [inlined]
[5] gather!(dst::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ NNlibCUDA C:\Users\Luffy\.julia\packages\NNlibCUDA\vECff\src\gather.jl:62
[6] gather(src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:77
[7] top-level scope
@ REPL[135]:1
[8] top-level scope
@ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\initialization.jl:52
Looks like gather allocates an empty array of the right size:
https://github.com/FluxML/NNlib.jl/blob/master/src/gather.jl#L76
so this can probably be fixed by adding a short-circuit like isempty(dst) && return dst in gather!, before it launches kernels?
https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/gather.jl#L52
Should be an easy PR you're interested. Would want tests in this file:
https://github.com/FluxML/NNlibCUDA.jl/blob/master/test/gather.jl
so this can probably be fixed by adding a short-circuit like isempty(dst) && return dst in gather!, before it launches kernels?
We need to check max(index)<=size(src)[end-M:end]. This is missing in general if index is on GPU. Or we can simply write
if size(srt,1)==0
NNlib.gather(srt,idx_gpu) = NNlib.gather(srt,cpu(idx_gpu))
end
This works fine but is not ideal
Maybe it's a good idea to remove the @inbounds macro in
https://github.com/FluxML/NNlibCUDA.jl/blob/fb6fe8efa4764e989d4a328232433ca0fde129bd/src/gather.jl#L32
and
https://github.com/FluxML/NNlibCUDA.jl/blob/fb6fe8efa4764e989d4a328232433ca0fde129bd/src/gather.jl#L43
After doing this, we have the desired error info
10×1 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
Error showing value of type CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
ERROR: KernelException: exception thrown during kernel execution on device NVIDIA GeForce RTX 3070 Laptop GPU
Stacktrace:
@yuehhua
It's not good to remove @inbounds in GPU kernel. Dimensions of CuArray should be checked outside kernel such that kernel should work properly. I agree with the idea from @mcabbott.
So the problem is we need to add a bounds checking function here, and make it compatible with empty arrays
In the beginning, gather is not considered to accept empty array. The CPU case is a coincidence. If it is intended to be compatible with GPU, returns a empty array is reasonable. If throwing an error is expected, just check if an empty array is received and throw the error out. You don't need to deal with bound check. The empty array input is root cause, and the indexing is derived issue.
The CPU case is a coincidence
It is a coincidence we should avoid. Using NNlib alone is not calling NNlibCUDA
using NNlib, CUDA
src = CUDA.rand(2,3)
NNlib.gather(src,cu[1,4])
ERROR: BoundsError: attempt to access 2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer} at index [1:2, 4]
Stacktrace:
[1] throw_boundserror(A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, I::Tuple{Base.Slice{Base.OneTo{Int64}}, Int64})
@ Base .\abstractarray.jl:691
[2] checkbounds
@ .\abstractarray.jl:656 [inlined]
[3] view
@ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\array.jl:617 [inlined]
[4] _view(X::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, colons::Tuple{Colon}, k::Int64)
@ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\scatter.jl:38
[5] gather!(dst::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:27
[6] gather(src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, idx::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
@ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:77
[7] top-level scope
@ c:\Users\Luffy\gather_test.jl:5
unless we write
using NNlib, CUDA
using NNlibCUDA
src = CUDA.rand(2,3)
NNlib.gather(src,cu[1,4])
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.430532 0.0
0.474528 0.0
Now we have the bounds checking thing. This causes problems downstream as in https://github.com/CarloLucibello/GraphNeuralNetworks.jl/issues/181. Now it's more like a bug.
the indexing is derived issue
As you can see above, bounds checking is a separate issue, empty array is another, and you need to deal with both.
Oh! Now I get your point.
The so-called coincidence may be the third issue. For instance, if I'm using Flux.jl, which imports NNlibCUDA, then I have no idea I should always put the index on GPU. Very easily, I could write something like
NNlib.gather(srt,[2,3,4])
It won't throw an error since under the hood we are calling NNlib.gather! not NNlibCUDA.gather!. Assuming everything is fine with NNlibCUDA.gather!, then either we should automatically move the index to GPU when NNlibCUDA.gather! is in the namespace (it should always be there since we already know srt is on GPU) or it throws an error like "srt is on GPU, but idx is on CPU"
For these points, you can fire corresponding issues.
#416 and #415