NNlib.jl `relu` propagates NaN on CPU but not on GPU

edit: see https://github.com/FluxML/NNlib.jl/issues/509, this is due to the relu, not really the BatchNorm

julia> using Flux

julia> layer = BatchNorm(32, relu)
BatchNorm(32, relu)  # 64 parameters, plus 64 non-trainable

julia> layer(NaN32*zeros(Float32, (32,1)))
32×1 Matrix{Float32}:
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
   ⋮
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN

julia> gpu(layer)(gpu(NaN32*zeros(Float32, (32,1))))
32×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

edit: just saw I'm on Flux v0.12.10 so maybe this is outdated

Sep 28 '22 11:09 ericphanson

Can confirm the same behaviour on Flux v0.13.6 on the GPU

Sep 28 '22 12:09 theabhirath

Just wanted to add that the if no activation function is used, you get NaNs as expected.

gpu(BatchNorm(32))(gpu(NaN32*zeros(Float32, (32,1))))
32×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN

Sep 28 '22 14:09 a-cakir

julia> relu.([NaN32 NaN32]) 
1x2 Matrix{Float32}:
 NaN NaN

julia> relu.(gpu([NaN NaN])) 
1x2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0 0.0

Sep 28 '22 14:09 hannahilea

Probably this is the cuplprit: https://github.com/FluxML/NNlibCUDA.jl/blob/838699761a67572417e84ae78f1398b6860ec585/src/cudnn/activations.jl#L14

julia> using NNlib, CUDA

julia> relu.([NaN32 NaN32])
1×2 Matrix{Float32}:
 NaN  NaN

julia> relu.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

julia> sigmoid.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

julia> tanh.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

julia> using NNlibCUDA

julia> relu.(cu([NaN32 NaN32]))  # failure
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0

julia> relu.(cu([NaN32 NaN32]).+1)  # only happens when broadcasting directly on an array
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

julia> sigmoid.(cu([NaN32 NaN32]))  # similar rule for other functions doesn't hurt
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

julia> tanh.(cu([NaN32 NaN32]))
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 NaN  NaN

Sep 28 '22 15:09 mcabbott

Following that trail, I get to https://github.com/JuliaGPU/CUDA.jl/blob/0c5bd736f91877c3dfac1d08af3448bd08733d00/lib/cudnn/activation.jl#L32 (upstream docs: https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnNanPropagation_t) where it looks like the default is to not propagate NaNs. Do we want to use another option for Flux? I don't really know the pros and cons of NaN propagation on the gpu but naively to me it seems better if gpu behavior matches cpu.

Sep 28 '22 23:09 ericphanson

Is that path noticeably faster than using CUDA.jl's default broadcast? If not, perhaps we should drop it completely.

Sep 29 '22 02:09 ToucheSir

To be clear, I don't have a GPU to test this on. If someone wouldn't mind doing that and reporting back, we can proceed :)

Oct 10 '22 00:10 ToucheSir

Bumping this since it came up again in the issue transfer. The offer is open for anyone to do the benchmarking required. If we're close enough to cuDNN here, those custom overloads can just be deleted.

Jun 24 '23 05:06 ToucheSir

NNlib.jl NNlib.jl copied to clipboard

`relu` propagates NaN on CPU but not on GPU

NNlib.jl
NNlib.jl copied to clipboard