NNlib.jl Reshape of a view breaks batched_mul on GPU (scalar indexing happens)

Useful Info:

NNlib v0.8.16

I actually have use of this to implement very efficient multi head attention. I am sharing the minimal example to replicate below:

using Flux # to get batched_mul

using CUDA
CUDA.allowscalar(false)

A3 = rand(3,3,8) |> gpu
v3 = rand(4, 2, 3, 8) |> gpu
v4, v5, = eachslice(v3, dims=2)
v4
v6 = reshape(v4, size(v4)...)
v4 ⊠ A3 # works
v6 ⊠ A3 # breaks throwing a scalar indexing error

Error for reference:

Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

error(s::String) at error.jl:33
assertscalar(op::String) at GPUArraysCore.jl:100
getindex(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64, ::Int64) at indexing.jl:9
getindex at subarray.jl:276 [inlined]
_unsafe_getindex_rs at reshapedarray.jl:249 [inlined]
_unsafe_getindex at reshapedarray.jl:246 [inlined]
getindex at reshapedarray.jl:234 [inlined]
getindex at subarray.jl:276 [inlined]
_generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:830
generic_matmatmul!(C::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::SubArray{Float32, 2, Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, false}, B::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Float32, Float32}) at matmul.jl:798
mul! at matmul.jl:302 [inlined]
batched_mul_generic!(C::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, α::Float32, β::Float32) at batchedmul.jl:274
_batched_try_gemm! at batchedmul.jl:219 [inlined]
_batched_mul! at batchedmul.jl:211 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
batched_mul! at batchedmul.jl:205 [inlined]
_batched_mul at batchedmul.jl:61 [inlined]
batched_mul(A::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}) at batchedmul.jl:48
top-level scope at model.jl:135

Jan 26 '23 08:01 reachtarunhere

I think it's taking the generic path here:

https://github.com/FluxML/NNlib.jl/blob/7f6ea504bc1aa9c2a0567d6bf2da6f1fdcca6352/src/batched/batchedmul.jl#L219

because is_strided on this array is says it's not strided:

julia> summary(v6)
"4×3×8 reshape(view(::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, :, 1, :, :), 4, 3, 8) with eltype Float32"

julia> NNlib.is_strided(v6)  # NNlib's test for whether to use BLAS
false

julia> strides(v6)  # maybe this is OK?
(1, 8, 24)

julia> strides(copy(v6))
(1, 4, 12)

That test is intended to match what CUBLAS accepts here:

https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/batchedmul.jl#L3-L4

and in this case, it doesn't seem to accept v6:

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v4, A3, false, v4 ⊠ A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4 ⊠ A3) |> summary
ERROR: conversion to pointer not defined for Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] unsafe_convert(#unused#::Type{CuPtr{Float32}}, a::Base.ReshapedArray{Float32, 3, SubArray{Float32, 3, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}}, false}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}})
   @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/pointer.jl:64

It might be possible to widen this... a very quick hack seems to run:

julia> Base.unsafe_convert(::Type{CuPtr{Float32}}, a::typeof(v6)) = Base.unsafe_convert(CuPtr{Float32}, parent(parent(v6)))

julia> CUDA.CUBLAS.gemm_strided_batched!('N', 'N', true, v6, A3, false, v4 ⊠ A3) |> summary
"4×3×8 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}"

But I have not checked correctness, and in reality the method would need to work only for reshapes which are in fact sufficiently simple. After updating CUDA to allow this, the test used by NNlib could be widened to match.

Jan 27 '23 14:01 mcabbott

I think CUDA.jl is a little gun-shy about accommodating too many wrappers after https://github.com/JuliaGPU/CUDA.jl/issues/453. Worth a shot though.

Jan 27 '23 15:01 ToucheSir

This multi-wrapper cases is a little easier, in that creating a pointer must either succeed or fail. CUDA certainly owns unsafe_convert(::Type{CuPtr{Float32}}, a::Any).

The worse case for multiple wrappers is where Base has fun(A::AbstractArray, B::AbstractArray), and trying to make dispatch catch things like fun(A::ReshapedArray{... CuArray}, B::Adjoint{..., CuArray}) leads to an explosion of methods. NNlib's dispatch on storage_type here is an attempt to avoid that (a possibly too-complicated one).

Jan 29 '23 17:01 mcabbott

NNlib.jl NNlib.jl copied to clipboard

Reshape of a view breaks batched_mul on GPU (scalar indexing happens)

NNlib.jl
NNlib.jl copied to clipboard