NNlib.jl Maxpool misbehaving in some edge cases

I am using NNlibCUDA.maxpool to calculate a sliding window maximum (I know there may be other/better ways of doing it). Unfortunately it fails catastrophically in some interesting cases. I will attach a MWE where I use a (8, 1, 1, 1) CuArray and a (5, 3) kernel, but in reality I use a (320001, 32) CuArray and a (2049, 3) kernel. I do not see the same behaviour when using NNlib and native arrays.

using CUDA
using NNlib
using NNlibCUDA

N = (8, 3, 1, 1)
K = (5, 3)
x = rand(N...)
x_c = CUDA.rand(N...)
nnlib = maxpool(x, K; pad=Tuple(k÷2 for k ∈ K), stride=(1, 1))
nnlib_cuda = maxpool(x_c, K; pad=Tuple(k÷2 for k ∈ K), stride=(1, 1))

@assert maximum(nnlib) == maximum(x)
@assert maximum(nnlib_cuda) == maximum(x_c)

From Project.toml:

CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
NNlibCUDA = "a00861dc-f156-4864-bf3c-e6376f28a68d"

Please let me know if you need any further information.

Jan 06 '23 13:01 LlewellynS96

I'm not able to replicate this. Can you post the output of CUDA.versioninfo() as well as ] st (Pkg status)? I would also try creating a fresh environment with only CUDA, NNlib and NNlibCUDA to see if that makes a difference.

Jan 07 '23 03:01 ToucheSir

I couldn't replicate either. NNlib v0.8.14, NNlibCUDA v0.2.5

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.65.1, for CUDA 11.7
CUDA driver 11.7

Libraries:
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.65.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.4
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.293 GiB / 11.000 GiB available)
  1: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.670 GiB / 11.000 GiB available)

Jan 07 '23 06:01 CarloLucibello

I managed to repro with the larger sizes mentioned (output contains mostly zeros when it shouldn't). If someone can figure out what we're passing to https://github.com/JuliaGPU/CUDA.jl/blob/v3.12.1/lib/cudnn/pooling.jl and if any of those parameters look incorrect, that should help immensely with fixing this bug.

Jan 08 '23 03:01 ToucheSir

Can reproduce. The maxima don't seem to differ much, last digit & not always. But the zeros are reliably wrong at this size, but not much smaller:

julia> begin
         K2 = (300, 1)
         N = (300_000, 32, 1, 1)
         x_c = CUDA.rand(N...)
         nnlib_cuda = maxpool(x_c, K2; stride=1)  # slightly simplified
         maximum(nnlib_cuda) == maximum(x_c)
       end
false

julia> maximum(x_c) => maximum(nnlib_cuda)
0.9999999f0 => 0.9999995f0

julia> count(iszero, x_c) => count(iszero, nnlib_cuda)
0 => 8388608

julia> device()
CuDevice(0): Tesla V100-PCIE-16GB

(@v1.10) pkg> st CUDA
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v3.12.1

Jan 08 '23 16:01 mcabbott

NNlib.jl NNlib.jl copied to clipboard

Maxpool misbehaving in some edge cases

NNlib.jl
NNlib.jl copied to clipboard