MixedModels.jl
MixedModels.jl copied to clipboard
Termination at saddle point
I'm facing an issue where a linear mixed model terminates at a saddle point for some datasets. I've tried to search for any prior discussion but was unable to find anything. The short version is that BOBYQA terminates at a saddle point where Optim's BFGS does not. I did a brief search for discussions of BOBYQA and saddle points but didn't find anything obvious. I'm wondering if it might be worthwhile to switch to a derivative based optimization algorithm.
Now to the details. I'm working on a bioequivalence testing problem and I don't have the luxury of choosing the model. The model is stated in SAS form in appendix E of https://www.fda.gov/media/70958/download and for most datasets, I'm matching the SAS results exactly with the MixedModels formulation below. However, for some datasets (looks like it's associated with missings), the MixedModels fits end up at saddle points. This dataset is used for the reproducer
In the code below, I make tiny perturbations of the observations. Even though they are tiny, it's sufficient to make the difference between reaching an extremum and a saddle point (as I'll show further down)
using MixedModels, CSV, DataFrames, StatsBase, Random
using SparseArrays: nzrange
saddlepointdata = CSV.read("saddlepointdata.csv", DataFrame)
transform!(saddlepointdata, "AUCT" => eachindex => "row")
N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = [
fit(
LinearMixedModel,
@formula(log(AUCT) ~
trt +
seq +
per +
(trt + 0 | sub) +
zerocorr(trt + 0 | row)
),
transform(
saddlepointdata,
"AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
);
contrasts = Dict(
[
:trt,
:per,
:sub,
:row,
] .=> Ref(DummyCoding())
),
REML = true
) for _ in 1:N
];
countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
I'm focusing on the second coefficient because it's the one of interest in the equivalence testing and it makes the output less verbose when restricting focus to a single coefficient. The result of the code above is
Dict{Float64, Int64} with 2 entries:
0.9329 => 483
0.9301 => 17
The first of these solutions is an extremum and the second is a saddle point. To see this, we can run
using CairoMakie, SixelTerm
saddlepointfit = fts[findfirst(t -> exp(coef(t)[2]) < 0.931, fts)];
plotx = range(-12, stop=12, length=101)
_i = 5
_θ = copy(saddlepointfit.θ)
ploty = [
MixedModels.deviance(
MixedModels.updateL!(
MixedModels.setθ!(
deepcopy(saddlepointfit), # to avoid mutating the original version
_θ + [zeros(_i - 1); _x; zeros(5 - _i)]
)
)
)
for _x in plotx
];
lineplot(plotx, ploty, xlim = extrema(plotx), width = 80)
which gives
so we are clearly not at a minimum.
To dig a little deeper, I'm using a version of the objective function that allows for ForwardDiff. Find the code in the folded section below
Forward compatible MixedModels code
using Optim, LinearAlgebra
mybound = x -> [exp.(x[1:3]); x[4]; exp.(x[5:6])]
function my_fit(
ft::MixedModels.LinearMixedModel;
show_trace = false,
extended_trace = false
)
opt_res = optimize(
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
BFGS(initial_stepnorm = 1.0),
Optim.Options(;show_trace, extended_trace),
autodiff=:forward
) do θ
my_deviance(ft, mybound(θ))
end
new_ft = deepcopy(ft)
MixedModels.setθ!(new_ft, mybound(opt_res.minimizer)[1:end - 1])
MixedModels.updateL!(new_ft)
return new_ft
end
function my_deviance(model::LinearMixedModel, θ::AbstractVector{T}) where {T}
σ² = θ[end]^2
_θ = θ[1:(end-1)]
dof = MixedModels.ssqdenom(model)
# Extract and promote
A, L, reterms = model.A, model.L, model.reterms
AA = [LinearAlgebra.copy_oftype(Ai, T) for Ai in A]
LL = [LinearAlgebra.copy_oftype(Li, T) for Li in L]
RR = [LinearAlgebra.copy_oftype(Ri, T) for Ri in reterms]
# Update state with new θ
_setθ!(RR, model.parmap, _θ)
myupdateL!(AA, LL, RR)
r² = _pwrss(LL)
ld = _logdet(LL, RR, model.optsum.REML)
return dof * log(2 * π * σ²) + ld + r² / σ²
end
function _setθ!(
reterms::Vector{<:MixedModels.AbstractReMat},
parmap::Vector{<:NTuple},
θ::AbstractVector,
)
length(θ) == length(parmap) || throw(DimensionMismatch())
reind = 1
λ = first(reterms).λ
for (tv, tr) in zip(θ, parmap)
tr1 = first(tr)
if reind ≠ tr1
reind = tr1
λ = reterms[tr1].λ
end
λ[tr[2], tr[3]] = tv
end
return reterms
end
function myupdateL!(A::Vector, L::Vector, reterms::Vector)
k = length(reterms)
copyto!(last(L), last(A)) # ensure the fixed-effects:response block is copied
for j in eachindex(reterms) # pre- and post-multiply by Λ, add I to diagonal
cj = reterms[j]
diagind = MixedModels.kp1choose2(j)
MixedModels.copyscaleinflate!(L[diagind], A[diagind], cj)
for i = (j+1):(k+1) # postmultiply column by Λ
bij = MixedModels.block(i, j)
MixedModels.rmulΛ!(copyto!(L[bij], A[bij]), cj)
end
for jj = 1:(j-1) # premultiply row by Λ'
MixedModels.lmulΛ!(cj', L[MixedModels.block(j, jj)])
end
end
for j = 1:(k+1) # blocked Cholesky
Ljj = L[MixedModels.kp1choose2(j)]
for jj = 1:(j-1)
_rankUpdate!(
Hermitian(Ljj, :L),
L[MixedModels.block(j, jj)],
-one(eltype(Ljj)),
one(eltype(Ljj)),
)
end
_cholUnblocked!(Ljj, Val{:L})
LjjT = isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj)
for i = (j+1):(k+1)
Lij = L[MixedModels.block(i, j)]
for jj = 1:(j-1)
mul!(
Lij,
L[MixedModels.block(i, jj)],
L[MixedModels.block(j, jj)]',
-one(eltype(Lij)),
one(eltype(Lij)),
)
end
rdiv!(Lij, LjjT')
end
end
return nothing
end
_pwrss(L::Vector) = abs2(last(last(L)))
function _logdet(L::Vector, reterms::Vector{<:MixedModels.AbstractReMat}, REML::Bool)
@inbounds s = sum(j -> MixedModels.LD(L[MixedModels.kp1choose2(j)]), axes(reterms, 1))
if REML
lastL = last(L)
s += MixedModels.LD(lastL) # this includes the log of sqrtpwrss
s -= log(last(lastL)) # so we need to subtract it from the sum
end
return s + s # multiply by 2 b/c the desired det is of the symmetric mat, not the factor
end
LinearAlgebra.copy_oftype(A::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T} =
MixedModels.UniformBlockDiagonal(LinearAlgebra.copy_oftype(A.data, T))
function LinearAlgebra.copy_oftype(
A::MixedModels.BlockedSparse{<:Any,S,P},
::Type{T},
) where {T,S,P}
cscmat = LinearAlgebra.copy_oftype(A.cscmat, T)
nzsasmat = LinearAlgebra.copy_oftype(A.nzsasmat, T)
MixedModels.BlockedSparse{T,S,P}(cscmat, nzsasmat, A.colblkptr)
end
function LinearAlgebra.copy_oftype(A::MixedModels.ReMat{<:Any,S}, ::Type{T}) where {T,S}
MixedModels.ReMat{T,S}(
A.trm,
A.refs,
A.levels,
A.cnames,
LinearAlgebra.copy_oftype(A.z, T),
LinearAlgebra.copy_oftype(A.wtz, T),
LinearAlgebra.copy_oftype(A.λ, T),
A.inds,
LinearAlgebra.copy_oftype(A.adjA, T),
LinearAlgebra.copy_oftype(A.scratch, T),
)
end
function _cholUnblocked!(A::StridedMatrix, ::Type{Val{:L}})
cholesky!(Hermitian(A, :L))
return A
end
function _cholUnblocked!(D::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T}
Ddat = D.data
for k in axes(Ddat, 3)
_cholUnblocked!(view(Ddat, :, :, k), T)
end
return D
end
function _cholUnblocked!(A::Diagonal, ::Type{T}) where {T}
A.diag .= sqrt.(A.diag)
return A
end
_cholUnblocked!(A::AbstractMatrix, ::Type{T}) where {T} = MixedModels.cholUnblocked!(A, T)
function _rankUpdate!(
C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
A::StridedMatrix{T},
α,
β,
) where {T}
Cdat = C.data.data
LinearAlgebra.require_one_based_indexing(Cdat, A)
isone(β) || rmul!(Cdat, β)
blksize = size(Cdat, 1)
for k in axes(Cdat, 3)
ioffset = (k - 1) * blksize
joffset = (k - 1) * blksize
for i in axes(Cdat, 1), j = 1:i
iind = ioffset + i
jind = joffset + j
AtAij = 0
for idx in axes(A, 2)
# because the second multiplicant is from A', swap index order
AtAij += A[iind, idx] * A[jind, idx]
end
Cdat[i, j, k] += α * AtAij
end
end
return C
end
function _rankUpdate!(
C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
A::MixedModels.BlockedSparse{T,S},
α,
β,
) where {T,S}
Ac = A.cscmat
cp = Ac.colptr
all(==(S), diff(cp)) ||
throw(ArgumentError("Columns of A must have exactly $S nonzeros"))
Cdat = C.data.data
LinearAlgebra.require_one_based_indexing(Ac, Cdat)
j, k, l = size(Cdat)
S == j == k && div(Ac.m, S) == l ||
throw(DimensionMismatch("div(A.cscmat.m, S) ≠ size(C.data.data, 3)"))
nz = Ac.nzval
rv = Ac.rowval
@inbounds for j in axes(Ac, 2)
nzr = nzrange(Ac, j)
# BLAS.syr!('L', α, view(nz, nzr), view(Cdat, :, :, div(rv[last(nzr)], S)))
_x = view(nz, nzr)
view(Cdat, :, :, div(rv[last(nzr)], S)) .+= α .* _x .* _x'
end
return C
end
function _rankUpdate!(
C::LinearAlgebra.HermOrSym{T,S},
A::StridedMatrix{T},
α,
β,
) where {T,S}
# BLAS.syrk!(C.uplo, 'N', T(α), A, T(β), C.data)
C.data .*= β
C.data .+= α .* A * A'
return C
end
_rankUpdate!(C::AbstractMatrix, A::AbstractMatrix, α, β) =
MixedModels.rankUpdate!(C, A, α, β)
Notice that this version of the objective function hasn't concentrated out σ. With this code, I can compute the Hessian which clearly shows the indefiniteness
julia> using ForwardDiff
julia> (objective(saddlepointfit), my_deviance(saddlepointfit, [_θ; saddlepointfit.sigma]))
(-284.3311306944413, -284.33113069939395)
julia>
julia> ForwardDiff.hessian(θσ -> my_deviance(saddlepointfit, θσ), [_θ; saddlepointfit.sigma])
6×6 Matrix{Float64}:
1.04012 9.41825e-6 0.00227411 0.000198533 1.81058e-8 1469.2
9.41825e-6 0.58946 -1.33622e-5 -0.0003711 0.000425082 1660.82
0.00227411 -1.33622e-5 0.860294 0.0357195 -2.86937e-7 1510.92
0.000198533 -0.0003711 0.0357195 0.288273 -2.35253e-6 -1.50027
1.81058e-8 0.000425082 -2.86937e-7 -2.35253e-6 -0.051874 1.04004
1469.2 1660.82 1510.92 -1.50027 1.04004 9.40305e6
Finally, we can use the ForwardDiff compatible code to fit the model with BFGS using ForwardDiff based derivatives
julia> my_fts = [my_fit(ft) for ft in fts];
julia> countmap(round.(exp.(getindex.(coef.(my_fts), 2)), digits=4))
Dict{Float64, Int64} with 1 entry:
0.9329 => 500
so with BFGS, we never end up at the saddle point (for this dataset). Of course, we don't know if that is the case for other datasets but I wanted to share this observation to start the conversation. Maybe a gradient based optimization should be considered. It will of course require some work since the version I threw together here is rather allocating so can't be used right away but might still be useful as a starting point.
@andreasnoack Can you show us versioninfo()?
@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off. That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.
In my own experience, BOBYQA rarely fails on well-posed problems, but we do support using other derivative-free algorithms implemented in NLopt. For example, if we change to COBYLA, things work:
N = 500
_rng = Random.seed!(Random.default_rng(), 124)
formula = @formula(log(AUCT) ~ trt + seq + per +
(trt + 0 | sub) +
zerocorr(trt + 0 | row))
contrasts = Dict(:trt => DummyCoding(), :per => DummyCoding(),
:sub => Grouping(), :row => Grouping())
fts = @showprogress map(1:N) do _
data = transform(saddlepointdata,
"AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT")
model = LinearMixedModel(formula, data; contrasts)
model.optsum.optimizer = :LN_COBYLA
return fit!(model; REML=true)
end
countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
yields
Dict{Float64, Int64} with 1 entry:
0.933 => 500
We have five test dataset where we could reproduce the saddle point. When running the test script above with those datasets, COBYLA always avoids the saddle point solutions that BOBYQA sometimes terminate at.
It would be interesting to know if this an issue with the implementation of BOBYQA or an issue with the algorithm. cc @stevengj.
@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off.
I'd be curious to read more about this. My expectation would be that it would be beneficial to use the gradient.
That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.
The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.
The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.
@andreasnoack definitely! That's been on my secret todo list for a very long time, but other things were always higher priority.
@andreasnoack it's stale relative to main, but I think this was the most recent effort: https://github.com/JuliaStats/MixedModels.jl/tree/forwarddiff
@andreasnoack see also https://github.com/JuliaStats/MixedModels.jl/issues/312
As I mention in #742, I have been looking at switching MixedModels.jl to the PRIMA.jl version of bobyqa for the constrained, derivative-free optimization. @zaikunzhang, the developer of libprima, spent considerable amounts of time translating Powell's original f77 code to a modularized modern Fortran. Along the way he discovered and fixed a number of bugs.
I will let you know if I am able to apply PRIMA.bobyqa to this problem
Because of #833, I revisited this issue. I just tried out the PRIMA backend and the issue persists. I.e.
N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = map(1:N) do _
m = LinearMixedModel(
@formula(log(AUCT) ~
trt +
seq +
per +
(trt + 0 | sub) +
zerocorr(trt + 0 | row)
),
transform(
saddlepointdata,
"AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
);
contrasts = Dict(
[
:trt,
:per,
:sub,
:row,
] .=> Ref(DummyCoding())
)
)
m.optsum.backend = :prima
# m.optsum.optimizer = :cobyla
m.optsum.optimizer = :bobyqa
fit!(
m;
REML = true
)
end
countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
gives
julia> countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
Dict{Float64, Int64} with 2 entries:
0.933 => 191
0.93 => 309
and the issue goes away with :cobyla. This might be of interest to @zaikunzhang
I don't understand the model. If you have random effects for trt by row then these random effects are confounded with the per-observation noise term. It is not surprising to me that estimation is unstable. One result that I got was
julia> m1 = ans
Linear mixed model fit by maximum likelihood
:(log(AUCT)) ~ 1 + trt + seq + per + (trt + 0 | sub) + zerocorr(trt + 0 | row)
logLik -2 logLik AIC AICc BIC
163.7786 -327.5571 -303.5571 -302.0571 -262.7792
Variance components:
Column Variance Std.Dev. Corr.
row trt: R 0.00004520 0.00672329
trt: T 0.05359023 0.23149563 .
sub trt: R 0.02513834 0.15855076
trt: T 0.01178772 0.10857125 -0.18
Residual 0.00004303 0.00655984
Number of obs: 221; levels of grouping factors: 221, 64
Fixed-effects parameters:
──────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|)
──────────────────────────────────────────────────────
(Intercept) 4.62829 0.0285346 162.20 <1e-99
trt: T -0.0697061 0.0344717 -2.02 0.0432
seq: TRTR 0.015153 0.0315155 0.48 0.6306
per: 2 -0.0543228 0.0345276 -1.57 0.1156
per: 3 0.00428555 0.00265393 1.61 0.1064
per: 4 -0.0570323 0.0345231 -1.65 0.0985
──────────────────────────────────────────────────────
julia> m1.θ
5-element Vector{Float64}:
1.0249175097804113
35.289831304567535
24.169914254418064
-2.9855064857546267
16.279405622424537
Notice that the elements of θ, which are rarely greater than 1.0, are as large as 35. This is because all of the per-observation noise is being incorporated into the random effects by row. I'm not sure that it is meaningful to use zerocorr here because one of the random effects for each level of row is always zero.
julia> first(ranef(m1))
2×221 Matrix{Float64}:
0.0 0.00424502 0.0 -0.00443598 0.0015443 … 0.0 -0.0039046 0.0 0.00303718 -0.00310098
-0.393695 0.0 0.14303 0.0 0.0 0.279237 0.0 -0.0453793 0.0 0.0
I'm sure that SAS PROC MIXED users can convince themselves that this makes sense but I can't.
The model might not make sense, but you are in a better position to tell the FDA than I am. However, even if the model doesn't make sense, it seems wrong to me that BOBYQA terminates at a saddle point. I've compiled a debug build of nlopt so I might try so see if I can make any sense of the extra output that can be enabled.
As you said, the FDA used a SAS PROC MIXED formulation of the model so there is potentially a problem in translating the PROC MIXED formulation to the MixedModels.jl formulation. Even after 30 years of thinking about mixed-effects models there are some PROC MIXED specifications that I just don't understand.
If the model is over-specified, which I believe is the case, then most positions in the parameter space could represent saddle points.
The fact remains that this model confounds part of the zerocorr(0 + trt|row) random effects term with the residual noise term. As far as I can see the model is inestimable.
I don't think that the by-row random effect is in the SAS model. I didn't see it in my reading of it, but I've pinged @ararslan who's actually worked with SAS before.
There are two separate issues here. Both are interesting but I think it might be worthwhile to consider them separately
- Does the model make sense and does it do the same thing as the SAS version
- Is it a bug in BOBYQA that it terminates with
SUCCESSat a saddle point. Here it doesn't matter what created the objective function.
Regarding 1., then clearly it is overparameterized. I'm shoehorning MixedModels into fitting a model with an R matrix, see the SAS docs, that isn't just $σ²I$ since the SAS formulation has two different sigmas, one for each formulation. The R part is what the REPEATED statement specifies. I can mimic the R behavior in MixedModels by introducing the "row factor" but then the residual variance parameter is redundant. My thinking was that it would be a minor issue with that redundancy as long as you didn't interpret the estimates of those parameters then it should give you the same result as SAS.
Regarding 2., then, as mentioned above, I think we can ignore how the objective function was created. Does it really make sense to return SUCCESS when terminating at a saddle point?
-
I think you need to set the residual standard deviation to make that trick work -- the
σkwarg tofitwill handle that. -
I think it would be better if (this implementation of) BOBYQA didn't terminate at a saddle point, but I don't know if it's fully avoidable. BOBYQA is doing a quadratic approximation, so it may not "see" the saddle point. The PRIMA version's use of a confidence radius may be better than the formulation used in NLopt, but I haven't thought too deeply about this (I'm writing this quickly while waiting for CI at work to run). That said, I think the quadratic approximation bit is important here -- in an idealized mixed model model, the objective function is quadratic so BOBYQA will do well very well. Of course, neither models nor the data that they are fit to are ideal in real life, so we do need to be able to detect when things break down.
I agree with @palday 's point 1. I'll try fitting the model with the residual standard deviation constrained to 1.0
BTW, if anyone else has tried that please let me know. I have been having difficulty with my vision due to cataracts. Right now they have operated on the right eye but it will be another week until I have the left eye done so I am in a weird half-blurry, half-sharp state w.r.t. my vision. It makes reading kind of challenging.
The first here is using fit!(model; REML=true, σ=1.0) and the second does not specify σ:
julia> countmap(round.(exp.(getindex.(coef.(fts), 2)); digits=3))
Dict{Float64, Int64} with 1 entry:
0.928 => 500
julia> countmap(round.(exp.(getindex.(coef.(fts0), 2)); digits=3))
Dict{Float64, Int64} with 2 entries:
0.933 => 477
0.93 => 23
Thanks for looking at that, @ararslan .
I appreciate the insight but I'm still confused. I tried varying the value of the σ argument and it does tend to give the lowest value of the objective when it is fixed at around 0.009, which is where the optimum occurs if the σ argument is not specified. I'm really not sure what the criterion means at this point. I can believe that things go wonky with the definition of the log-likelihood or the REML criterion and there should be some correction term but right now I don't know what that would be.
I read carefully all the discussions. However --- forgive my ignorance about Julia and the specific application --- I did not even figure out the following basic questions:
-
What was the objective function?
-
What were the decision variables?
-
What were the constraints?
-
What were the solutions produced by different solvers?
-
What does the nice curve mean in the original question posted two years ago?
-
Why was it obvious that the solution produced by BOBYQA was a saddle point rather than a (local) minimum?
From a theoretical point of view, generically, it is almost impossible for a optimization solver --- sloppy or good --- to converge to a "saddle point", as such points are unstable.
Thanks.
@ararslan I think the value would have to be sufficiently small to not impose a restriction on the model since the components cannot be negative. The estimated sigmas in the output that @dmbates shared are 0.00004520 and 0.05359023 for the R and T specific variances and 0.00004303 for the residual variance so I think the pinned σ should be no greater than
julia> sqrt(0.00004520 + 0.00004303)
0.009393082561119113
to avoid restricting the model. With σ=0.009, I'm getting
julia> countmap(round.(exp.(getindex.(coef.(fts), 2)); digits=3))
Dict{Float64, Int64} with 2 entries:
0.933 => 492
0.93 => 8
BOBYQA is doing a quadratic approximation, so it may not "see" the saddle point.
@palday I only have a high level understanding BOBYQA but I have also had this thought. However, if that is the case then it is a severe limitation in the method that should be mentioned in the description of it and suggests that a quasi Newton method might be a better option. We only detected this issue because of an extensive comparison against SAS results so there might be several MixedModels fits that hits this issue.
@zaikunzhang thanks for joining the conversation.
- What was the objective function?
It is the restricted maximum likelihood objective function of a mixed effects model after concentrating out the fixed effects parameters. There are some details in https://juliastats.org/MixedModels.jl/stable/optimization/ but the code in the first is a complete example, although in Julia.
- What were the decision variables?
The elements of the Cholesky factors of the random effects, i.e. elements of $\Lambda_\theta$ in the link above.
The diagonal elements of the Cholesky factors are restricted to be positive. The off-diagonal elements are unrestricted.
- What were the solutions produced by different solvers?
BFGS and COBYLA consistently give 0.9329 (although I recently hit https://github.com/JuliaStats/MixedModels.jl/issues/833 with the latter)
- What does the nice curve mean in the original question posted two years ago?
It is the objective function around the saddle point solution as a function of the fifth decision variable.
- Why was it obvious that the solution produced by BOBYQA was a saddle point rather than a (local) minimum?
The graph in the figure is clearly concave at zero (the "solution" returned by nlopt and PRIMA) in the fifth coordinate. I also included the Hessian in the first post which clearly isn't positive definite.
@zaikunzhang The derivation and interpretation of the objective is described more fully in https://arxiv.org/abs/2505.11674
From the point of view of the optimization problem, the objective is a function of, in this case, a parameter vector of length 5. A summary of the optimization stage for one of these examples is
julia> m.optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value: 38318.58532665327
Backend: prima
Optimizer: cobyla
Lower bounds: [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg: 1.0
rhoend: 1.0e-6
maxfeval: -1
Function evaluations: 1306
xtol_zero_abs: 0.001
ftol_zero_abs: 1.0e-5
Final parameter vector: [0.3681424453109934, 25.722562383726956, 17.90268794574329, -2.207715770191636, 12.560645490417164]
Final objective value: -286.16940754656434
Return code: SMALL_TR_RADIUS
The constraints on this parameter vector are lower bounds on four of the five scalar values, which are constrained to be non-negative.
The representation of the statistical model converts these parameters into small, lower triangular matrices that are the lower Cholesky factors of covariance matrices (which must be positive semi-definite).
julia> m.θ
5-element Vector{Float64}:
0.3681424453109934
25.722562383726956
17.90268794574329
-2.207715770191636
12.560645490417164
julia> println.(getproperty.(m.reterms, :λ));
[0.3681424453109934 0.0; 0.0 25.722562383726956]
[17.90268794574329 0.0; -2.207715770191636 12.560645490417164]
For this model the first of these Cholesky factors is diagonal and the second is lower triangular.
The profiled objective, which is negative twice the log-likelihood (or the restricted log-likelihood), is a function of these parameters only. There are other parameters in the statistical model but their optimal values, given these parameters, can be evaluated without iteration. (Exactly how this is done is the subject of the arxiv paper mentioned above.)
The nature of the model means that the starting values, [1.0, 1.0, 1.0, 0.0, 1.0], are chosen so that the Cholesky factors generated from these are identity matrices.
If you wish we can provide the parameter values and the corresponding objective values during the course of the optimization using bobyqa and using cobyla. Evaluation of the objective itself is somewhat involved and our Julia code may not be easily translated into other programming languages. We use Julia's multiple dispatch capabilities extensively and also accelerated BLAS implementations. So even on the same processor family and operating system there may be very slight differences in the value of the objective when, say, different BLAS implementations change the order of operations in the Lapack function dpotrf.
The behind-the-scenes problem we are discussing is in the application of the statistical model in the analysis of pharmacokinetic data for submission to the Food and Drug Administration (FDA) in the United States. Their recommendations on how to perform the analysis uses proprietary software from SAS Institute and their algorithms are not disclosed. It is a difficult situation of trying to reproduce results from other software without a detailed description of how that software works.
Please let us know if there is any further information we can provide for you.
Thank you @andreasnoack for the explanation. I understand things better now but still not completely.
Let me try. So the solution returned by COBYLA and BFGS is 0.9329, which is considered correct, but BOBYQA returns 0.9301, which is believed to be saddle point. Is this right?
If that is the case, could you check
-
What are the function values corresponding to 0.9328 and 0.9301?
-
What is your stopping criteria? Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?
Let me repeat: generically, it is impossible for any solver to converge to a saddle point.
Thank you @andreasnoack for the explanation. I understand things better now but still not completely.
Let me try. So the solution returned by COBYLA and BFGS is 0.9329, which is considered correct, but BOBYQA returns 0.9301, which is believed to be saddle point. Is this right?
If that is the case, could you check
-
What are the function values corresponding to 0.9328 and 0.9301?
-
What is your stopping criteria? Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?
Let me repeat: generically (it has a particular mathematical meaning), it is impossible for any solver to converge to a saddle point.
@zaikunzhang I should add that to generate the output
Dict{Float64, Int64} with 2 entries:
0.9329 => 483
0.9301 => 17
that shows the two different solutions, I'm adding a tiny random perturbation to the data at each of 500 times I'm running the optimization. Based on those 500 fits, I'm getting
julia> combine(groupby(tmp_df, "xval"), "fval" .=> [mean, var])
2×3 DataFrame
Row │ xval fval_mean fval_var
│ Float64 Float64 Float64
─────┼─────────────────────────────────
1 │ 0.93 -284.331 2.70211e-11
2 │ 0.933 -286.169 1.05848e-11
which gives you the function values associated with the two termination values.
What is your stopping criteria?
Please see https://github.com/JuliaStats/MixedModels.jl/issues/705#issuecomment-3028183418, which, I believe, should have this information.
Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?
I interpret this as setting rhoend. That essentially gives the same result and I'm not reaching that maximum number of iterations
julia> maximum([ft.optsum.feval for ft in fts])
694
Update: Let me share the optimization output for one of each results
Bad
julia> fts[1].optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value: -44.70893444134606
Backend: prima
Optimizer: bobyqa
Lower bounds: [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg: 1.0
rhoend: 2.220446049250313e-16
maxfeval: -1
Function evaluations: 325
xtol_zero_abs: 0.001
ftol_zero_abs: 1.0e-5
Final parameter vector: [0.2618306642347577, 27.735563195267034, 17.367158317488414, -2.1765460166731314, 0.0]
Final objective value: -284.33112518002326
Return code: SMALL_TR_RADIUS
Good
julia> fts[6].optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value: -44.708934441224415
Backend: prima
Optimizer: bobyqa
Lower bounds: [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg: 1.0
rhoend: 2.220446049250313e-16
maxfeval: -1
Function evaluations: 583
xtol_zero_abs: 0.001
ftol_zero_abs: 1.0e-5
Final parameter vector: [0.7466427813690414, 30.138734078056896, 20.96629512954934, -2.5830367079671253, 14.69080363737362]
Final objective value: -286.1694145110057
Return code: SMALL_TR_RADIUS
FWIW, here are the optimization traces for a good and a bad fit
xtol_zero_abs: 0.001 ftol_zero_abs: 1.0e-5
What are these parameters? Are they part of the stopping criteria? They are not native PRIMA parameters. They probably come from the upper-level interface. Could you set them to the values that will not interfere the termination (maybe 0)? Thanks.
BTW, is your objective function smooth? Noisy?
@palday I can see that you added them in https://github.com/JuliaStats/MixedModels.jl/commit/34899cf6d95abb2df8c961c9de5a6a500351ebff. What do they do? I tried to set them zero and it doesn't seem to make a difference.
The MixedModels.jl code can use either the NLopt versions (Julia package is NLopt.jl) or the libprima versions (Julia package is PRIMA.jl) of BOBYQA or COBYLA. The internal structure that holds the parameters for the optimizers represents the union of the parameters for the two different implementations. It is possible that a couple of the NLopt parameters leaked into the .optsum display for the PRIMA version. In other words, the structure contains the optimizer parameters for both implementations. The display should show only those from the PRIMA optimizer when PRIMA is in use but it may accidentally contain a couple of parameters from NLopt.
Looking at the traces it seems that the big difference is whether x5 gets away from zero, its lower bound.
The traces also show me something about the first plot (the one that looks like a "bell curve" with upturned edges) that I didn't understand. That curve should not have negative values on the x-axis because of the lower bound. It is not a matter of converging to a saddle point as much as it is a matter of converging on the boundary.
So if the iterations can get away from that boundary around 250 evaluations or so then they proceed to the optimum - otherwise they converge on the boundary
It is not a matter of converging to a saddle point as much as it is a matter of converging on the boundary.
Ah. I didn't notice ealier that the fifth element was at the boundary in the bad runs. Maybe the combination of being at the boundary and a zero gradient is a problem? @zaikunzhang thoughts on that?
I tried to change the lower bounds to -Inf for the parameters associated with the triangular factor. The positive sign is only a convention and not needed for the result to be valid. Without the lower bounds, I'm getting
julia> countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
Dict{Float64, Int64} with 1 entry:
0.933 => 500
i.e. not bad solutions. This raises two questions
- Is the behavior of BOBYQA expected when at the boundary?
- Should MixedModels stop imposing non-negativity constraints on Cholesky factor parameters?
- Is the behavior of BOBYQA expected when at the boundary?
Things become clearer now. It is due to the bound constraint.
This is not unexpected. Indeed, any method that does not exploit second-order (curvature) information may get trapped by such a "saddle point" when solving bound constraints problems. This is different from the unconstrained case, where convergence to saddle points is generically impossible for a proper algorithm.
This is not unique to BOBYQA. I believe COBYLA and (finite-difference) BFGS can encounter the same problem if your test is sufficiently extensive. Even for algorithms using exact gradients (rather than derivative-free methods), it is not difficult to construct non-pathological examples where the problem occurs often.