Turing.jl stderror() on optimize() result fails

(apologies for closing and reopening)

I noticed that stderror() (and by extension coeftable()) fails in this case with a DomainError:

using Turing
using SequentialSamplingModels
using Random
using StatsPlots
using StatsBase
using Optim

# Generate some data with known parameters
dist = LBA(ν=[3.0, 2.0], A=0.8, k=0.2, τ=0.3)
data = rand(dist, 100)

@model function model_lba(data; min_rt=minimum(data.rt))
    # Priors
    ν ~ filldist(Normal(0, 2), 2)
    A ~ truncated(Normal(0.8, 0.4), 0.0, Inf)
    k ~ truncated(Normal(0.2, 0.2), 0.0, Inf)
    τ ~ truncated(Beta(2, 4), 0.0, min_rt)

    # Likelihood
    data ~ LBA(; ν, A, k, τ)
end

m = model_lba(data; min_rt=minimum(data.rt))
mle = optimize(m, MLE())

stderror(mle)

ERROR: DomainError with -0.00041917649359655615:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
DomainError detected in the user `f` function. This occurs when the domain of a function is violated.
For example, `log(-1.0)` is undefined because `log` of a real number is defined to only output real
numbers, but `log` of a negative number is complex valued and therefore Julia throws a DomainError
by default. Cases to be aware of include:

* `log(x)`, `sqrt(x)`, `cbrt(x)`, etc. where `x<0`
* `x^y` for `x<0` floating point `y` (example: `(-1.0)^(1/2) == im`)

Within the context of SciML, this error can occur within the solver process even if the domain constraint
would not be violated in the solution due to adaptivity. For example, an ODE solver or optimization
routine may check a step at `new_u` which violates the domain constraint, and if violated reject the
step and use a smaller `dt`. However, the throwing of this error will have halted the solving process.

Thus the recommended fix is to replace this function with the equivalent ones from NaNMath.jl
(https://github.com/JuliaMath/NaNMath.jl) which returns a NaN instead of an error. The solver will then
effectively use the NaN within the error control routines to reject the out of bounds step. Additionally,
one could perform a domain transformation on the variables so that such an issue does not occur in the
definition of `f`.

For more information, check out the following FAQ page:
https://docs.sciml.ai/Optimization/stable/API/FAQ/#The-Solver-Seems-to-Violate-Constraints-During-the-Optimization,-Causing-DomainErrors,-What-Can-I-Do-About-That?

Stacktrace:
  [1] throw_complex_domainerror(f::Symbol, x::Float64)
    @ Base.Math .\math.jl:33
  [2] sqrt
    @ .\math.jl:677 [inlined]
  [3] _broadcast_getindex_evalf
    @ .\broadcast.jl:683 [inlined]
  [4] _broadcast_getindex
    @ .\broadcast.jl:656 [inlined]
  [5] getindex
    @ .\broadcast.jl:610 [inlined]
  [6] macro expansion
    @ .\broadcast.jl:974 [inlined]
  [7] macro expansion
    @ .\simdloop.jl:77 [inlined]
  [8] copyto!
    @ .\broadcast.jl:973 [inlined]
  [9] copyto!
    @ .\broadcast.jl:926 [inlined]
 [10] copy
    @ .\broadcast.jl:898 [inlined]
 [11] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{NamedArrays.NamedArray}, Nothing, typeof(sqrt), Tuple{NamedArrays.NamedVector{Float64, Vector{Float64}, Tuple{OrderedCollections.OrderedDict{Symbol, Int64}}}}})
    @ Base.Broadcast .\broadcast.jl:873
 [12] stderror(model::Turing.ModeResult{NamedArrays.NamedVector{Float64, Vector{Float64}, Tuple{OrderedCollections.OrderedDict{Symbol, Int64}}}, Optim.MultivariateOptimizationResults{LBFGS{Nothing, LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Optim.var"#19#21"}, Vector{Float64}, Float64, Float64, Vector{OptimizationState{Float64, LBFGS{Nothing, LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Optim.var"#19#21"}}}, Bool, NamedTuple{(:f_limit_reached, :g_limit_reached, :h_limit_reached, :time_limit, :callback, :f_increased), NTuple{6, Bool}}}, LogDensityFunction{DynamicPPL.TypedVarInfo{NamedTuple{(:ν, :A, :k, :τ), Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:ν, Setfield.IdentityLens}, Int64}, Vector{DistributionsAD.TuringScalMvNormal{Vector{Float64}, Float64}}, Vector{AbstractPPL.VarName{:ν, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:A, Setfield.IdentityLens}, Int64}, Vector{Truncated{Normal{Float64}, Continuous, Float64, Float64, Float64}}, Vector{AbstractPPL.VarName{:A, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:k, Setfield.IdentityLens}, Int64}, Vector{Truncated{Normal{Float64}, Continuous, Float64, Float64, Float64}}, Vector{AbstractPPL.VarName{:k, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}, DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:τ, Setfield.IdentityLens}, Int64}, Vector{Truncated{Beta{Float64}, Continuous, Float64, Float64, Float64}}, Vector{AbstractPPL.VarName{:τ, Setfield.IdentityLens}}, Vector{Float64}, Vector{Set{DynamicPPL.Selector}}}}}, Float64}, DynamicPPL.Model{typeof(model_lba), (:data, :min_rt), (:min_rt,), (), Tuple{NamedTuple{(:choice, :rt), Tuple{Vector{Int64}, Vector{Float64}}}, Float64}, Tuple{Float64}, DynamicPPL.DefaultContext}, Turing.ModeEstimation.OptimizationContext{DynamicPPL.LikelihoodContext{Nothing}}}})
    @ StatsAPI C:\Users\domma\.julia\packages\StatsAPI\HGVqH\src\statisticalmodel.jl:149
 [13] top-level scope
    @ c:\Users\domma\Box\Software\easyRT\julia_examples.jl:76

Also tagging @itsdfish as the LBA distribution comes from SequentialSamplingModels.jl.

Any thoughts?

Jul 19 '23 17:07 DominiqueMakowski

I'm not sure exactly what is wrong, but I suspect that it might be in this function, where the information matrix is computed.

With Random.seed!(54), the result is:

mle
ModeResult with maximized lp of 67.20
[3.276040489991997, 2.337659249931617, 0.8017492796484292, 0.06519091281327583, 0.33938286544022755]

stderror(mle)
A    │ 
─────┼───────────
ν[1] │   0.787391
ν[2] │    0.66385
A    │   0.224157
k    │  0.0452707
τ    │ 0.00990502

Both seem reasonable to me. However, with Random.seed!(65), there is an error. Further investigation reveals two problems:

using LinearAlgebra
lba_vcov = vcov(mle)
lba_vcov[I(5)]

A    │ 
─────┼───────────
ν[1] │  3.91107e7
ν[2] │    1.77096
A    │ -0.0257405
k    │   0.349782
τ    │ 0.00691001

The variance of ν[1] is very large and the variance of A is negative. If I increase the sample size to 200 or 500, the error is less frequent. It is possible that the standard errors are unstable (perhaps the derivatives are very small) when the sample size is small. To test this I tried using MAP in hopes that an informed prior would help stabilize the model, but the error still occurs.

Another thing I want to note is that the likelihood function is more complex than a typical GLM. I don't know if the ForwardDiff is using the correct rules. Another potential issue is I don't know how it handles parameter bounds and transformations. This type of model does not fit within the Distributions.jl type system. I added a new type:

SSM2D = Distribution{Multivariate, Mixed}

to reflect the fact that the model produces a mixed multivariate distribution consisting of one discrete variable and one continuous positive variable. I'm not sure if any of this is relevant. I would be happy to help where I can.

Jul 19 '23 20:07 itsdfish

This is very weird. A few things to check, that you may have already looked at:

Is the optimization result actually done? I could imagine the optimization terminating early or having relatively low default tolerances, which would explain why the result is different at different seeds and why some parts of the diagonal are negative.
It could be that the logpdf of LBA is pretty unstable -- knowing @itsdfish, I would not really expect this. However, the fact that the logpdf for LBA is defined as log(pdf(...)) is a bit worrying from a numeric perspective.

That's just two guesses off the hip. I haven't run the code yet, but it does seem that it's been replicated.

Jul 26 '23 19:07 cpfiffer

Is the optimization result actually done? I could imagine the optimization terminating early or having relatively low default tolerances, which would explain why the result is different at different seeds and why some parts of the diagonal are negative.

As far as I can tell, the optimization algorithm is converging and the default tolerances are not too large. Nonetheless, it might be worth investigating alternative algorithms and settings because at low sample sizes the MLEs jump around drastically, suggesting the algorithm is finding different local maxima. In addition, the estimate for the k parameter is close to zero when the standard error fails. This is a known problem with nonlinear models with more than 2 or 3 parameters. One solution is to run the optimization algorithm multiple times and select the best result. Here is a stopgap solution:

function robust_optimize(model, args...; opt_method=MLE(), n_reps, kwargs...)
    max_mle = optimize(model, opt_method, args...; kwargs...)
    max_lp = max_mle.lp 
    for _ ∈ 2:n_reps 
        mle = optimize(model, opt_method, args...; kwargs...)
        if mle.lp > max_lp 
            max_lp = mle.lp 
            max_mle = mle 
        end
    end
    return max_mle 
end

~~I ran the following about 30 times with a sample size of 10 until I came across an error~~:

mle = robust_optimize(m; opt_method=MLE(), n_reps=10)
stderror(mle)

~~So its not perfect, but I think it mostly solves the problem~~. I'm not sure whether this problem is common outside of this context. If it is, maybe an option for taking the best of multiple runs would be useful to the community.

2. It could be that the logpdf of LBA is pretty unstable -- knowing @itsdfish, I would not really expect this. However, the fact that the logpdf for LBA is defined as log(pdf(...)) is a bit worrying from a numeric perspective.

I appreciate the vote of confidence, but I wouldn't be surprised if I did something incorrectly =). You bring up a good a good point about the calculation of the logpdf. The PDF for the LBA is somewhat complex, and I doubt one can get very far with taking the log and simplifying/expanding. I checked a few popular packages in R for the LBA and it seems that they use the same approach. Given that lots of smart people have coded and used the LBA, my guess is that its probably not possible to make it much better.

Jul 26 '23 21:07 itsdfish

My apologies. I think the problem still persists. After restarting my session, it stopped working reliably. I think I may have been working with an old variable. Nonetheless, I think the points I made mostly hold. First, the typology of the log likelihood surface is highly convoluted and irregular for these types of models, particularly at small sample sizes where the derivatives are shallow. This will cause the the algorithm to find a local maxima, or some parameters to collapse to zero. Second, part of the solution will likely involve running the optimizer multiple times. Third, you might have to play around with some different optimizers. @DominiqueMakowski, you might try NelderMead or some kind of particle-based algorithm.

Jul 26 '23 21:07 itsdfish

Should this issue be transferred to Optim?

Aug 03 '23 09:08 DominiqueMakowski

The source of the problem is not clear in my opinion. Here is what we observe: when the sample size is small (e.g., 10), the MLEs are highly variable from run to run using the same data. When the MLE for k or A is close to zero, the SEs are either large or negative. One possibility is that the topology of the log likelihood surface is irregular or very flat for the LBA, which makes it unstable. The other possibility is there is a bug in the code, either in Optim, ForwardDiff, Turing, or somewhere else. There seem to be a lot of possibilities from my perspective. One thing that might distinguish between these possibilities is to save same data to a CSV and see if a similar problem exists in R using the same optimization algorithm and settings. If it does, then problem is likely that the LBA is too difficult for the algorithm. Would you be willing to look into that and share the R code and RNG seed used to generate the test data?

Aug 03 '23 10:08 itsdfish

I'm not even sure it is possible to reproduce that in R though... the closest thing I'd see is doing something like that, but I don't think it's implemented for LBA

Aug 03 '23 10:08 DominiqueMakowski

Optimizers accept a function with required inputs. So you would have to modify the example. However, I think you need to use optim with "L-BFGS-B" as the algorithm to make it comparable to the Turing default (but also check the settings and tolerances). Basically, you need to create a function that accepts a vector of parameters, an input that specifies the lower and upper bounds of the parameters, and other inputs to pass the data. The function should output the negative of the sum of log densities. If you have trouble, I can help the following weekend. Just tag me on here.

Aug 03 '23 19:08 itsdfish

Here's a snippet that shows the issue without using Turing.jl:

import Random
import SequentialSamplingModels
import Distributions
import Optim
import LinearAlgebra
import StatsFuns
import ForwardDiff

Random.seed!(23)
N = 100
true_params = [3.0, 2.0, 0.8, 0.2, 0.3]

make_dist(params) = SequentialSamplingModels.LBA(; ν=params[1:2], A=params[3], k=params[4], τ=params[5])

data = Random.rand(make_dist(true_params), N)
min_rt = minimum(data.rt)

# Transform arbitrary real parameters to the space of allowed parameters of the model.
transform_Ak(x) = exp(x)
transform_tau(x) = StatsFuns.logistic(x) * min_rt

function transform_vec(params)
    return [
        params[1],
        params[2],
        transform_Ak(params[3]),
        transform_Ak(params[4]),
        transform_tau(params[5]),
    ]
end

# The objective function to be minimized. Computes the negative log likelihood.
obj_fun(params) = -sum(Distributions.logpdf(make_dist(params), data))
obj_fun_transformed(params) = obj_fun(transform_vec(params))

# Run the optimization in the unconstrained parameter space.
result = Optim.optimize(obj_fun_transformed, true_params, Optim.LBFGS())
# Find the minimizer in the constrained parameter space. 
optimum = transform_vec(result.minimizer)
@show obj_fun(optimum)
@show result.minimum
@show optimum

# Study the gradient and the Hessian at the optimum.
grad_cons = ForwardDiff.gradient(obj_fun_transformed, result.minimizer)
grad = ForwardDiff.gradient(obj_fun, optimum)
hess = ForwardDiff.hessian(obj_fun, optimum)
println("Gradient in the unconstrained space:")
display(grad_cons)
println("Gradient in constrained space:")
display(grad)
println("Hessian in constrained space:")
display(hess)
println("Eigenvalues of the Hessian in constrained space:")
display(LinearAlgebra.eigvals(hess))

Output:

obj_fun(optimum) = -43.54033967928539
result.minimum = -43.54033967928539
optimum = [3.1320823256376564, 1.7421775229640926, 0.9907423073030727, 2.0174732805488142e-13, 0.3560112015877908]
Gradient in the unconstrained space:
5-element Vector{Float64}:
  1.3911627627649636e-8
 -9.754065000144863e-9
  1.8947312052652876e-8
  2.0620893936762994e-8
 -6.56081787088136e-10
Gradient in constrained space:
5-element Vector{Float64}:
      1.3911627627649636e-8
     -9.754065000144863e-9
      1.912436342621504e-8
 102211.48470998602
    -15.106596321349619
Hessian in constrained space:
5×5 Matrix{Float64}:
    20.3718       -0.0258564  -51.6217   -3.52996e5   -1861.18
    -0.0258564    14.3871     -24.002   -31.6841       -132.185
   -51.6217      -24.002      180.549   377.88          899.214
    -3.52996e5   -31.6841     377.88      5.15035e17      4.75029e15
 -1861.18       -132.185      899.214     4.75029e15      3.30172e13
Eigenvalues of the Hessian in constrained space:
5-element Vector{Float64}:
  -1.0794953204180992e13
   0.885329580019804
  15.797612316129877
 198.6236114938754
   5.1507896607501664e17

So Optim indeed finds a point in the space of unconstrained variables where the gradient is zero. But the gradient as computed by ForwardDiff in the constrained space of variables is very much not zero, and the Hessian has some frighteningly large elements and a very negative eigenvalue.

ReverseDiff gives similar results. Maybe the optimisation landscape is truly awful around the local minimum where one of the parameters has collapsed to zero, or there's some other numerical instability?

Regardless, this seem unrelated to Turing.jl, so I'll close the issue. If you think I'm mistaken let me know and we can reopen.

PS. If it is of any help, the latest version of Turing can also do constrained optimisation with user-provide constraints, so you could try enforcing that the problematic parameters that tend to collapse to zero are not only >0, but e.g. >0.1, like this:

mle = Turing.maximum_likelihood(
    m,
    OptimizationOptimJL.IPNewton();
    initial_params=[3.0, 2.0, 0.8, 0.2, 0.3],
    lb=[0.0, 0.0, 0.1, 0.1, 0.1],
    ub=fill(5.0, 5),
)

A quick experiment I did suggests though that the optimisation continues to get stuck in local minima, so this may not be of much help here.

Jun 24 '24 15:06 mhauru

@mhauru, thank you for the detailed analysis and explanation. I agree with your assessment that it is unrelated to Turing.

Jun 24 '24 20:06 itsdfish

Turing.jl Turing.jl copied to clipboard

stderror() on optimize() result fails

Turing.jl
Turing.jl copied to clipboard