NNlib.jl
NNlib.jl copied to clipboard
Error when using Float64: ERROR: UndefRefError: access to undefined reference
Distilled from this discussion. I couldn't remove anything further and still reproduce the crash.
Julia 1.4.1, Flux 0.10.4
using Flux
function mwe(T)
int1 = Dense(4, 280)
resd(X) = reshape(int1(X), 10, 7, 4, :)
tc1 = ConvTranspose((4, 3), 4 => 4, relu, stride = (2, 2), pad = 1)
mdl = Chain(resd, tc1)
z = [1,2,3,4]
X̂ = mdl(z)
X = randn(T, size(X̂)...)
loss(y) = -sum(Flux.binarycrossentropy.(mdl(z), y))
ps = Flux.params(mdl)
gs = gradient(ps) do
loss(X)
end
end
julia> mwe(Float32); # success
julia> mwe(Float64)
ERROR: UndefRefError: access to undefined reference
Stacktrace:
[1] getindex at ./array.jl:789 [inlined]
[2] conv_direct!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}; alpha::Float64, beta::Bool) at /home/russel/.julia/packages/NNlib/FAI3o/src/impl/conv_direct.jl:98
[3] conv_direct! at /home/russel/.julia/packages/NNlib/FAI3o/src/impl/conv_direct.jl:51 [inlined]
[4] conv!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:99
[5] conv!(::Array{AbstractFloat,5}, ::Array{AbstractFloat,5}, ::Array{Float32,5}, ::NNlib.DenseConvDims{3,(4, 3, 1),4,4,(2, 2, 1),(1, 1, 1, 1, 0, 0),(1, 1, 1),false}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:97
[6] conv!(::Array{AbstractFloat,4}, ::Array{AbstractFloat,4}, ::Array{Float32,4}, ::NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:70
[7] conv! at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:70 [inlined]
[8] conv(::Array{AbstractFloat,4}, ::Array{Float32,4}, ::NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:116
[9] conv at /home/russel/.julia/packages/NNlib/FAI3o/src/conv.jl:114 [inlined]
[10] FluxML/Flux.jl#1837 at /home/russel/.julia/packages/Zygote/YeCEW/src/lib/nnlib.jl:41 [inlined]
[11] (::Zygote.var"#4556#back#1839"{Zygote.var"#1837#1838"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Array{Float32,4},Array{Float32,4},NNlib.DenseConvDims{2,(4, 3),4,4,(2, 2),(1, 1, 1, 1),(1, 1),false}}})(::Array{AbstractFloat,4}) at /home/russel/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
[12] ConvTranspose at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/conv.jl:148 [inlined]
[13] (::typeof(∂(λ)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[14] applychain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:36 [inlined]
[15] (::typeof(∂(applychain)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[16] applychain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:36 [inlined]
[17] (::typeof(∂(applychain)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[18] Chain at /home/russel/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:38 [inlined]
[19] (::typeof(∂(λ)))(::Array{Float64,4}) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[20] loss at /home/russel/Desktop/MWE.jl:13 [inlined]
[21] (::typeof(∂(λ)))(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[22] FluxML/Flux.jl#9 at /home/russel/Desktop/MWE.jl:16 [inlined]
[23] (::typeof(∂(λ)))(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
[24] (::Zygote.var"#49#50"{Zygote.Params,Zygote.Context,typeof(∂(λ))})(::Float64) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:179
[25] gradient(::Function, ::Zygote.Params) at /home/russel/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:55
[26] mwe(::Type{T} where T) at /home/russel/Desktop/MWE.jl:15
[27] top-level scope at REPL[5]:1
[28] eval(::Module, ::Any) at ./boot.jl:331
[29] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
[30] run_backend(::REPL.REPLBackend) at /home/russel/.julia/packages/Revise/MgvIv/src/Revise.jl:1023
[31] top-level scope at none:0
Perhaps needs a loosening of the signature in NNlib
I've been experiencing a similar problem as raised here by @contradict (UndefRefError: access to undefined reference with a convolutional VAE). Is this issue still under consideration @DhairyaLGandhi ?
although the error here is weird and should be fixed, I'm not sure we want to support mixed Float32/Float64 computations, we should at least throw a warning. For example, all layers in the mwe function should be converted to Float64 using the f64 method when T == Float64
That sounds reasonable @CarloLucibello. I've spent a couple days trying to identify the root of the error to create a fix but have not been successful yet. Even when running the accepted solution in the linked discussion, I still encounter typing issues while trying to backprop the pooling layer.
I have run into this problem in a different way. The main culprit seems to be in NNlib. In conv.jl:89
y = similar(x, promote_type(xT, wT), output_size(cdims)...,
channels_out(cdims), size(x,N))
and conv_direct.jl:98, 141
y[w_idx, h_idx, d_idx, c_out, batch] = alpha*dotprod + beta*y[w_idx, h_idx, d_idx, c_out, batch]
The problem arises when similar creates an array of undef values and conv_direct! tries to read them.
Even though beta is set to false by default, asking for the value of y[...] results in an exception.
My quick workaround is changing the behavior of beta to select the value of y[...] or 0.
Maybe the initializer of the given datatype should be fixed so it wont result in undef values.
y[w_idx, h_idx, d_idx, c_out, batch] = alpha*dotprod + (beta ? y[w_idx, h_idx, d_idx, c_out, batch] : 0)
(I am not an expert in Julia, so I am not sure which behavior should change.)
That seems reasonable but similar shouldn't produce undef values. We have fallbacks for mixed f32/64 as well which warn appropriately.
I think it is not about f32/64 specifically, but some internal type handling during the gradient call.
I dont know how gradient works internally, but i would guess it passes a special type through similar that causes undef initialization.
In my case similar(zeros(3,3), Num, 2,2) produces #undef values.
FWIW the original MWE works without issue for me on Flux 0.12.7 and Zygote 0.6.23. We'll need a new one to keep investigating this.
@zomborid beta in comv_direct! is actually not a boolean param, but assigned a bool value as a performance optimization. See https://github.com/FluxML/NNlib.jl/blob/c30ea9bf9d024adfeb99bf10fb8a1e91368ca8ea/src/impl/conv_direct.jl#L37-L40.
For my case I finally decided on the workaround of defining similar in the following way:
Base.similar(a::AbstractArray, ::Type{MySpecificType}, dims::Base.DimOrInd...) = fill(MySpecificType(), dims)
Although I think that reading a possibly undefined value is a bug in Flux. Maybe specialising on primitive types using the existing performance trick and in general using a properly initialized array would be better.
Wait, are you defining a completely custom numeric type? Certainly I'm surprised that it works at all then, though it is a fair argument that the optimized for simplicity direct conv implementation should be able unknown numeric types as long as they adhere to some interface.
The original MWE in the OP works for me on 1.7-rc1.
Since the extended issue seems to arise from similar, this can occur with any mutable user defined type. similar is even documented to return an uninitialized array:
help?> similar
search: similar
similar(array, [element_type=eltype(array)], [dims=size(array)])
Create an uninitialized mutable array with the given element type and size, based upon the given
source array.
If Flux requires this to be zeroed, I'd suggest using either zeros(T, dims..) or explicitly zeroing the resulting array (presumably the better choice, since that preserves array type). Using fill(zero(T), dims..) would run into having each entry of the resulting array be ===, since it doesn't copy for mutable types T (see https://github.com/JuliaLang/julia/issues/41209).
The MWE runs fine on Julia 1.8.5 with Flux 0.13.15.
So the fix is basically what @Seelengrab mentioned: add a loop at the start of the direct conv functions which zeros out the destination array. The original routine was written cleverly, but without regard for more exotic numeric types. Because the functions in question are fallback methods I would not consider this super high priority, but happy to guide someone through a PR if there's interest.