NNlib.jl icon indicating copy to clipboard operation
NNlib.jl copied to clipboard

Error when using Float64: ERROR: UndefRefError: access to undefined reference

Open contradict opened this issue 5 years ago • 14 comments

Distilled from this discussion. I couldn't remove anything further and still reproduce the crash.

Julia 1.4.1, Flux 0.10.4

using Flux                                                                       
                                                                                 
function mwe(T)                                                                  
    int1 = Dense(4, 280)                                                         
    resd(X) = reshape(int1(X), 10, 7, 4, :)                                      
    tc1 = ConvTranspose((4, 3), 4 => 4, relu, stride = (2, 2), pad = 1)          
    mdl = Chain(resd, tc1)                                                       
    z = [1,2,3,4]                                                                
    X̂ = mdl(z)                                                                   
    X = randn(T, size(X̂)...)                                                     
    loss(y) = -sum(Flux.binarycrossentropy.(mdl(z), y))                          
    ps = Flux.params(mdl)                                                           
    gs = gradient(ps) do                                                            
        loss(X)                                                                  
    end                                                                          
end                                                                                 

julia> mwe(Float32); # success

julia> mwe(Float64)

ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex at ./array.jl:789 [inlined]
 [2] conv_direct!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}; alpha::Float64, beta::Bool) at /home/russel/.julia/packages/NNlib/FAI3o/src/impl/conv_direct.jl:98
 [3] conv_direct! at /home/russel/.julia/packages/NNlib/FAI3o/src/impl/conv_direct.jl:51 [inlined]
 [4] conv!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:99
 [5] conv!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:97
 [6] conv!(::Array{AbstractFloat,4}, ::Array{AbstractFloat,4}, ::Array{Float32,4}, ::NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:70
 [7] conv! at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:70 [inlined]
 [8] conv(::Array{AbstractFloat,4}, ::Array{Float32,4}, ::NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:116
 [9] conv at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:114 [inlined]
 [10] FluxML/Flux.jl#1837 at /home/russel/.julia/packages/Zygote/YeCEW/src/lib/nnlib.jl:41 [inlined]
 [11] (::Zygote.var"#4556#back#1839"{Zygote.var"#1837#1838"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Array{Float32,4},Array{Float32,4},NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}}})(::Array{AbstractFloat,4}) at /home/russel/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
 [12] ConvTranspose at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/conv.jl:148 [inlined]
 [13] (::typeof(∂(λ)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [14] applychain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:36 [inlined]
 [15] (::typeof(∂(applychain)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [16] applychain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:36 [inlined]
 [17] (::typeof(∂(applychain)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [18] Chain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:38 [inlined]
 [19] (::typeof(∂(λ)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [20] loss at /home/russel/Desktop/MWE.jl:13 [inlined]
 [21] (::typeof(∂(λ)))(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [22] FluxML/Flux.jl#9 at /home/russel/Desktop/MWE.jl:16 [inlined]
 [23] (::typeof(∂(λ)))(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
 [24] (::Zygote.var"#49#50"{Zygote.Params,Zygote.Context,typeof(∂(λ))})(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:179
 [25] gradient(::Function, ::Zygote.Params) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:55
 [26] mwe(::Type{T} where T) at /home/russel/Desktop/MWE.jl:15
 [27] top-level scope at REPL[5]:1
 [28] eval(::Module, ::Any) at ./boot.jl:331
 [29] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [30] run_backend(::REPL.REPLBackend) at /home/russel/.julia/packages/Revise/MgvIv/src/Revise.jl:1023
 [31] top-level scope at none:0

contradict avatar May 22 '20 03:05 contradict

Perhaps needs a loosening of the signature in NNlib

DhairyaLGandhi avatar Jun 05 '20 09:06 DhairyaLGandhi

I've been experiencing a similar problem as raised here by @contradict (UndefRefError: access to undefined reference with a convolutional VAE). Is this issue still under consideration @DhairyaLGandhi ?

alecokas avatar Jun 30 '20 23:06 alecokas

although the error here is weird and should be fixed, I'm not sure we want to support mixed Float32/Float64 computations, we should at least throw a warning. For example, all layers in the mwe function should be converted to Float64 using the f64 method when T == Float64

CarloLucibello avatar Jul 01 '20 10:07 CarloLucibello

That sounds reasonable @CarloLucibello. I've spent a couple days trying to identify the root of the error to create a fix but have not been successful yet. Even when running the accepted solution in the linked discussion, I still encounter typing issues while trying to backprop the pooling layer.

alecokas avatar Jul 03 '20 08:07 alecokas

I have run into this problem in a different way. The main culprit seems to be in NNlib. In conv.jl:89

y = similar(x, promote_type(xT, wT), output_size(cdims)...,
                               channels_out(cdims), size(x,N))

and conv_direct.jl:98, 141

y[w_idx, h_idx, d_idx, c_out, batch] = alpha*dotprod + beta*y[w_idx, h_idx, d_idx, c_out, batch]

The problem arises when similar creates an array of undef values and conv_direct! tries to read them. Even though beta is set to false by default, asking for the value of y[...] results in an exception.

My quick workaround is changing the behavior of beta to select the value of y[...] or 0. Maybe the initializer of the given datatype should be fixed so it wont result in undef values.

y[w_idx, h_idx, d_idx, c_out, batch] = alpha*dotprod + (beta ? y[w_idx, h_idx, d_idx, c_out, batch] : 0)

(I am not an expert in Julia, so I am not sure which behavior should change.)

zomborid avatar Oct 08 '21 14:10 zomborid

That seems reasonable but similar shouldn't produce undef values. We have fallbacks for mixed f32/64 as well which warn appropriately.

DhairyaLGandhi avatar Oct 08 '21 14:10 DhairyaLGandhi

I think it is not about f32/64 specifically, but some internal type handling during the gradient call. I dont know how gradient works internally, but i would guess it passes a special type through similar that causes undef initialization. In my case similar(zeros(3,3), Num, 2,2) produces #undef values.

zomborid avatar Oct 08 '21 15:10 zomborid

FWIW the original MWE works without issue for me on Flux 0.12.7 and Zygote 0.6.23. We'll need a new one to keep investigating this.

ToucheSir avatar Oct 08 '21 16:10 ToucheSir

@zomborid beta in comv_direct! is actually not a boolean param, but assigned a bool value as a performance optimization. See https://github.com/FluxML/NNlib.jl/blob/c30ea9bf9d024adfeb99bf10fb8a1e91368ca8ea/src/impl/conv_direct.jl#L37-L40.

ToucheSir avatar Oct 08 '21 16:10 ToucheSir

For my case I finally decided on the workaround of defining similar in the following way:

Base.similar(a::AbstractArray, ::Type{MySpecificType}, dims::Base.DimOrInd...) = fill(MySpecificType(), dims)

Although I think that reading a possibly undefined value is a bug in Flux. Maybe specialising on primitive types using the existing performance trick and in general using a properly initialized array would be better.

zomborid avatar Oct 11 '21 10:10 zomborid

Wait, are you defining a completely custom numeric type? Certainly I'm surprised that it works at all then, though it is a fair argument that the optimized for simplicity direct conv implementation should be able unknown numeric types as long as they adhere to some interface.

ToucheSir avatar Oct 11 '21 15:10 ToucheSir

The original MWE in the OP works for me on 1.7-rc1.

Since the extended issue seems to arise from similar, this can occur with any mutable user defined type. similar is even documented to return an uninitialized array:

help?> similar
search: similar

  similar(array, [element_type=eltype(array)], [dims=size(array)])

  Create an uninitialized mutable array with the given element type and size, based upon the given
  source array.

If Flux requires this to be zeroed, I'd suggest using either zeros(T, dims..) or explicitly zeroing the resulting array (presumably the better choice, since that preserves array type). Using fill(zero(T), dims..) would run into having each entry of the resulting array be ===, since it doesn't copy for mutable types T (see https://github.com/JuliaLang/julia/issues/41209).

Seelengrab avatar Oct 16 '21 20:10 Seelengrab

The MWE runs fine on Julia 1.8.5 with Flux 0.13.15.

natema avatar Apr 25 '23 17:04 natema

So the fix is basically what @Seelengrab mentioned: add a loop at the start of the direct conv functions which zeros out the destination array. The original routine was written cleverly, but without regard for more exotic numeric types. Because the functions in question are fallback methods I would not consider this super high priority, but happy to guide someone through a PR if there's interest.

ToucheSir avatar Apr 25 '23 19:04 ToucheSir