NNlib.jl
NNlib.jl copied to clipboard
`relu` propagates NaN on CPU but not on GPU
edit: see https://github.com/FluxML/NNlib.jl/issues/509, this is due to the relu
, not really the BatchNorm
julia> using Flux
julia> layer = BatchNorm(32, relu)
BatchNorm(32, relu) # 64 parameters, plus 64 non-trainable
julia> layer(NaN32*zeros(Float32, (32,1)))
32×1 Matrix{Float32}:
NaN
NaN
NaN
NaN
NaN
NaN
NaN
⋮
NaN
NaN
NaN
NaN
NaN
NaN
julia> gpu(layer)(gpu(NaN32*zeros(Float32, (32,1))))
32×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
⋮
0.0
0.0
0.0
0.0
0.0
0.0
edit: just saw I'm on Flux v0.12.10 so maybe this is outdated
Can confirm the same behaviour on Flux v0.13.6 on the GPU
Just wanted to add that the if no activation function is used, you get NaNs
as expected.
gpu(BatchNorm(32))(gpu(NaN32*zeros(Float32, (32,1))))
32×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
julia> relu.([NaN32 NaN32])
1x2 Matrix{Float32}:
NaN NaN
julia> relu.(gpu([NaN NaN]))
1x2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
Probably this is the cuplprit: https://github.com/FluxML/NNlibCUDA.jl/blob/838699761a67572417e84ae78f1398b6860ec585/src/cudnn/activations.jl#L14
julia> using NNlib, CUDA
julia> relu.([NaN32 NaN32])
1×2 Matrix{Float32}:
NaN NaN
julia> relu.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
julia> sigmoid.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
julia> tanh.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
julia> using NNlibCUDA
julia> relu.(cu([NaN32 NaN32])) # failure
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
julia> relu.(cu([NaN32 NaN32]).+1) # only happens when broadcasting directly on an array
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
julia> sigmoid.(cu([NaN32 NaN32])) # similar rule for other functions doesn't hurt
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
julia> tanh.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
NaN NaN
Following that trail, I get to https://github.com/JuliaGPU/CUDA.jl/blob/0c5bd736f91877c3dfac1d08af3448bd08733d00/lib/cudnn/activation.jl#L32 (upstream docs: https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnNanPropagation_t) where it looks like the default is to not propagate NaNs. Do we want to use another option for Flux? I don't really know the pros and cons of NaN propagation on the gpu but naively to me it seems better if gpu behavior matches cpu.
Is that path noticeably faster than using CUDA.jl's default broadcast? If not, perhaps we should drop it completely.
To be clear, I don't have a GPU to test this on. If someone wouldn't mind doing that and reporting back, we can proceed :)
Bumping this since it came up again in the issue transfer. The offer is open for anyone to do the benchmarking required. If we're close enough to cuDNN here, those custom overloads can just be deleted.