Flux.jl icon indicating copy to clipboard operation
Flux.jl copied to clipboard

Remove Recurrent cell eltype restriction

Open rmsrosa opened this issue 4 years ago • 12 comments
trafficstars

Remove type restriction in the methods associated with recurrent cells RNN, LSTM, GRU and GRUv3.

Change conversion of Nil to other types, so Nil() gets converted to NaN16, NaN32, NaN for the corresponding float types and also complex types with float real and imaginary parts. Conversion errors for Integer and Rational types, in accordance to the errors when converting non integer floats.

This allows outputsize to work with recurrent cells as well.

Suitable tests were added.

This should fix #1565.

PR Checklist

  • [X ] Tests are added
  • [ ] Entry in NEWS.md
  • [ ] Documentation, if applicable
  • [ ] API changes require approval from a committer (different from the author, if applicable)

rmsrosa avatar Oct 25 '21 19:10 rmsrosa

Looks quite good to me, thanks!

DhairyaLGandhi avatar Oct 25 '21 20:10 DhairyaLGandhi

Could we do a few quick tests that make sure that mismatched eltypes work in the RNN layers as expected on the GPU

DhairyaLGandhi avatar Oct 25 '21 20:10 DhairyaLGandhi

Yeah, I didn't check any CUDA related stuff. No knowledge of that.

How bad are these unsuccessful tests above?

rmsrosa avatar Oct 25 '21 21:10 rmsrosa

For the RNN tests, first I'd create a regular rnn layer, and then pass it to Flux.f64 and run it with Float32 data.

And rinse and repeat for CUDA.

DhairyaLGandhi avatar Oct 25 '21 22:10 DhairyaLGandhi

This goes against #1483, so I don't think we should merge without a discussion. The implementation is good other than @CarloLucibello's suggestion though.

Are the errors from #1483 no longer an issue? What made them go away? Did something change on the Zygote side of thing @mcabbott?

darsnack avatar Oct 26 '21 00:10 darsnack

No idea, have never looked at #1483. When I try, I get different errors:

julia> Flux.train!(loss, Flux.params(m), x, ADAM())
ERROR: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}})(::Matrix{Float32}, ::Float64)

julia> x32 = [rand(Float32, 5) for i in 1:2]
2-element Vector{Vector{Float32}}:
 [0.19781059, 0.095588446, 0.7885516, 0.20238525, 0.21201706]
 [0.85013026, 0.44129986, 0.82300264, 0.8385034, 0.3073842]

julia> Flux.train!(loss, Flux.params(m), x32, ADAM())
ERROR: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}})(::Matrix{Float32}, ::Float32)
Closest candidates are:
  (::Flux.RNNCell{F, A, V, <:AbstractMatrix{T}})(::Any, ::Union{AbstractVector{T}, AbstractMatrix{T}, Flux.OneHotArray}) where {F, A, V, T} at ~/.julia/packages/Flux/ZnXxS/src/layers/recurrent.jl:103
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Zygote/rv6db/src/compiler/interface2.jl:0 [inlined]
  [2] _pullback(::Zygote.Context, ::Flux.RNNCell{typeof(tanh), Matrix{Float32}, Vector{Float32}, Matrix{Float32}}, ::Matrix{Float32}, ::Float32)
    @ Zygote ~/.julia/packages/Zygote/rv6db/src/compiler/interface2.jl:9
  [3] _pullback
    @ ~/.julia/packages/Flux/ZnXxS/src/layers/recurrent.jl:47 [inlined]

I see that Flux.Recur is mutable, and Zygote currently has issues with some such, and almost no tests... does Flux have decent tests for these?

mcabbott avatar Oct 26 '21 00:10 mcabbott

I don't think Flux.train! does the right thing at all when given batched sequence data, so the MWEs from #1483 definitely need to be tweaked.

ToucheSir avatar Oct 26 '21 00:10 ToucheSir

I'm afraid the proposed change results in hiding an error that arise from inconsistent eltype of input (X) and cell's initial state's (state0) when getting the gradients on GPU. For example, if feeding a RNN with Float64 while the model's state0 is initialized with defaults Float32:

using Flux
using Statistics: mean
using Random: seed!
using BenchmarkTools
using CUDA

#####################################
# RNN vanilla
#####################################
seed!(123)
feat = 8
h_size = 3
seq_len = 5
batch_size = 15

rnn = RNN(feat, h_size)

X = [rand(Float64, feat, batch_size) for i in 1:seq_len]
Y = rand(Float64, batch_size, seq_len) ./ 10

#### transfer to gpu ####
rnn_gpu = rnn |> gpu
X_gpu = CuArray{Float64}.(X)
Y_gpu = Float64.(Y)

θ = Flux.params(rnn)
θ_gpu = Flux.params(rnn_gpu)

function loss_cpu(x, y)
    # Flux.reset!(rnn)
    l = sum(Flux.stack(map(rnn, x), dims=2))  # edit: dims keyword
    return l
end

function loss_gpu(x, y)
    l = sum(Flux.stack(map(rnn_gpu, x), dims=2))  # edit
    return l
end

loss_cpu(X, Y)
loss_gpu(X_gpu, Y_gpu)

opt = ADAM(1e-3)
Flux.train!(loss_cpu, θ, [(X, Y)], opt)

opt_gpu = ADAM(1e-3)
Flux.train!(loss_gpu, θ_gpu, [(X_gpu, Y_gpu)], opt_gpu)

CPU's train! works fine, while GPU's results in:

ERROR: LoadError: this intrinsic must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ C:\Users\jerem\.julia\packages\Zygote\rv6db\src\compiler\interface2.jl:0 [inlined]
  [2] _pullback(::Zygote.Context, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
...

If using Float32 instead: X_gpu = CuArray{Float32}.(X), there's no issue.

Also, under the current forced parametrization on the RNNCell forward path definition function, the issue would be catched in advance through the missmatched types:

ERROR: LoadError: MethodError: no method matching (::Flux.RNNCell{typeof(tanh), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
Closest candidates are:
  (::Flux.RNNCell{F, A, V, var"#s526"} where var"#s526"<:AbstractMatrix{T})(::Any, ::Union{AbstractMatrix{T}, AbstractVector{T}, Flux.OneHotArray}) where {F, A, V, T} at 
c:\github\Flux.jl\src\layers\recurrent.jl:104

A way to properly handle Float32 params with Float64 input could be to specify the init_state type:

rnn = RNN(feat, h_size, init_state=zeros)

which results in a valid model on CPU. However, moving to GPU, the |> gpu approach converts all model parameters to Float32, which keeps the problem in place.

I guess ideal scenario is for the GPU gradient calculation to handle where state0 type differs from input/hidden state type, but I miss the underlying CUDA implications :\

jeremiedb avatar Nov 05 '21 07:11 jeremiedb

Eltype promotion is how Julia handles multiple code paths and is a necessary component in (say) working with complex data, or duals etc. We also have tools to manage eltypes consistently (f64, f32), and type stability is a pretty common pattern to follow too. In conv we warn on mismatched eltypes but don't error, so this can be made consistent.

DhairyaLGandhi avatar Nov 05 '21 08:11 DhairyaLGandhi

I think having at least a warning here is important. It's a more complex issue than the kind of promotion that occurs for convs, because of the type of the storage field.

I do think the better long term solution is an API. Per Julia's promotion system, the expected state type should be the promotion of the weights and the input. Right now we "guess" the state type based on the weights alone.

darsnack avatar Nov 05 '21 13:11 darsnack

Maybe there's a necessary compromise here since if the Recur's state isn't parametrized (in the struct definition), then the promotion will happen successfully on the state, even on GPU. However, Recur's output type then cannot be inferred, which has resulted in reported performance issue. But if the intrinsic must be compiled error on GPU can be resolved, it would result in the desired behavior right? Would be worth opening an issue in CUDA.jl?

As a side note, if one is aware of current gotchas and wants to initialize a model with Float32 weights and Float64 input and state, an issue is that that gpu conversion moves everything into Float32, thus losing the required Float64 for state:

rnn_cpu = RNN(feat, h_size, init=Flux.zeros32, initb=Flux.zeros32, init_state=zeros)
rnn_gpu = rnn_cpu |> gpu
julia> rnn_cpu.state
3×1 Matrix{Float64}:
 0.0
 0.0
 0.0

julia> rnn_gpu.state
3×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0

Wouldn't it be desirable for gpu to preserve the types? If such was the case, then the handling of different types for weights vs input/state would be at least more convenient as the above would result in valid coded for GPU.

jeremiedb avatar Nov 05 '21 18:11 jeremiedb

Wouldn't it be desirable for gpu to preserve the types?

I'm not sure on that last bit, since GPUs can have very different performance properties depending on precision. Either way, if we know ahead of time that gpu produces f32 we can much easier predict the eltypes we will get. Note that the error raised successfully prevents the user from mixing eltypes on the GPU which can have vastly larger differences in what performance we see. Is it common for users to need different eltypes for different parameters? Most cases I have seen don't seem to mind f32 everywhere, and don't actually intend to use the full precision of f64. Relaxing the eltypes makes it way easier to handle mixed precision cases without needing to hold on to too many specific kernels for different eltypes.

DhairyaLGandhi avatar Jan 17 '22 09:01 DhairyaLGandhi

superseded by #2234

rmsrosa avatar May 05 '23 14:05 rmsrosa