MixedModels.jl icon indicating copy to clipboard operation
MixedModels.jl copied to clipboard

Termination at saddle point

Open andreasnoack opened this issue 2 years ago • 8 comments

I'm facing an issue where a linear mixed model terminates at a saddle point for some datasets. I've tried to search for any prior discussion but was unable to find anything. The short version is that BOBYQA terminates at a saddle point where Optim's BFGS does not. I did a brief search for discussions of BOBYQA and saddle points but didn't find anything obvious. I'm wondering if it might be worthwhile to switch to a derivative based optimization algorithm.

Now to the details. I'm working on a bioequivalence testing problem and I don't have the luxury of choosing the model. The model is stated in SAS form in appendix E of https://www.fda.gov/media/70958/download and for most datasets, I'm matching the SAS results exactly with the MixedModels formulation below. However, for some datasets (looks like it's associated with missings), the MixedModels fits end up at saddle points. This dataset is used for the reproducer

saddlepointdata.csv

In the code below, I make tiny perturbations of the observations. Even though they are tiny, it's sufficient to make the difference between reaching an extremum and a saddle point (as I'll show further down)

using MixedModels, CSV, DataFrames, StatsBase, Random
using SparseArrays: nzrange

saddlepointdata = CSV.read("saddlepointdata.csv", DataFrame)
transform!(saddlepointdata, "AUCT" => eachindex => "row")

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = [
    fit(
        LinearMixedModel,
        @formula(log(AUCT) ~
            trt +
            seq +
            per +
            (trt + 0 | sub) +
            zerocorr(trt + 0 | row)
        ),
        transform(
            saddlepointdata,
            "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
        );
        contrasts = Dict(
            [
                :trt,
                :per,
                :sub,
                :row,
            ] .=> Ref(DummyCoding())
        ),
        REML = true
    ) for _ in 1:N
];

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

I'm focusing on the second coefficient because it's the one of interest in the equivalence testing and it makes the output less verbose when restricting focus to a single coefficient. The result of the code above is

Dict{Float64, Int64} with 2 entries:
  0.9329 => 483
  0.9301 => 17

The first of these solutions is an extremum and the second is a saddle point. To see this, we can run

using CairoMakie, SixelTerm

saddlepointfit = fts[findfirst(t -> exp(coef(t)[2]) < 0.931, fts)];

plotx = range(-12, stop=12, length=101)
_i = 5
_θ = copy(saddlepointfit.θ)
ploty = [
    MixedModels.deviance(
        MixedModels.updateL!(
            MixedModels.setθ!(
                deepcopy(saddlepointfit), # to avoid mutating the original version
                _θ + [zeros(_i - 1); _x; zeros(5 - _i)]
            )
        )
    )
    for _x in plotx
];
lineplot(plotx, ploty, xlim = extrema(plotx), width = 80)

which gives Skærmbillede 2023-08-10 kl  09 18 13 so we are clearly not at a minimum.

To dig a little deeper, I'm using a version of the objective function that allows for ForwardDiff. Find the code in the folded section below

Forward compatible MixedModels code
using Optim, LinearAlgebra
mybound = x -> [exp.(x[1:3]); x[4]; exp.(x[5:6])]

function my_fit(
  ft::MixedModels.LinearMixedModel;
  show_trace = false,
  extended_trace = false
  )
  opt_res = optimize(
      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
      BFGS(initial_stepnorm = 1.0),
      Optim.Options(;show_trace, extended_trace),
      autodiff=:forward
  ) do θ
      my_deviance(ft, mybound(θ))
  end
  new_ft = deepcopy(ft)
  MixedModels.setθ!(new_ft, mybound(opt_res.minimizer)[1:end - 1])
  MixedModels.updateL!(new_ft)
  return new_ft
end

function my_deviance(model::LinearMixedModel, θ::AbstractVector{T}) where {T}

  σ² = θ[end]^2
  _θ = θ[1:(end-1)]
  dof = MixedModels.ssqdenom(model)

  # Extract and promote
  A, L, reterms = model.A, model.L, model.reterms
  AA = [LinearAlgebra.copy_oftype(Ai, T) for Ai in A]
  LL = [LinearAlgebra.copy_oftype(Li, T) for Li in L]
  RR = [LinearAlgebra.copy_oftype(Ri, T) for Ri in reterms]

  # Update state with new θ
  _setθ!(RR, model.parmap, _θ)
  myupdateL!(AA, LL, RR)

  r² = _pwrss(LL)
  ld = _logdet(LL, RR, model.optsum.REML)

  return dof * log(2 * π * σ²) + ld + r² / σ²
end

function _setθ!(
  reterms::Vector{<:MixedModels.AbstractReMat},
  parmap::Vector{<:NTuple},
  θ::AbstractVector,
)
  length(θ) == length(parmap) || throw(DimensionMismatch())
  reind = 1
  λ = first(reterms).λ
  for (tv, tr) in zip(θ, parmap)
      tr1 = first(tr)
      if reind ≠ tr1
          reind = tr1
          λ = reterms[tr1].λ
      end
      λ[tr[2], tr[3]] = tv
  end
  return reterms
end

function myupdateL!(A::Vector, L::Vector, reterms::Vector)

  k = length(reterms)
  copyto!(last(L), last(A))  # ensure the fixed-effects:response block is copied
  for j in eachindex(reterms) # pre- and post-multiply by Λ, add I to diagonal
      cj = reterms[j]
      diagind = MixedModels.kp1choose2(j)
      MixedModels.copyscaleinflate!(L[diagind], A[diagind], cj)
      for i = (j+1):(k+1)     # postmultiply column by Λ
          bij = MixedModels.block(i, j)
          MixedModels.rmulΛ!(copyto!(L[bij], A[bij]), cj)
      end
      for jj = 1:(j-1)        # premultiply row by Λ'
          MixedModels.lmulΛ!(cj', L[MixedModels.block(j, jj)])
      end
  end
  for j = 1:(k+1)             # blocked Cholesky
      Ljj = L[MixedModels.kp1choose2(j)]
      for jj = 1:(j-1)
          _rankUpdate!(
              Hermitian(Ljj, :L),
              L[MixedModels.block(j, jj)],
              -one(eltype(Ljj)),
              one(eltype(Ljj)),
          )
      end
      _cholUnblocked!(Ljj, Val{:L})
      LjjT = isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj)
      for i = (j+1):(k+1)
          Lij = L[MixedModels.block(i, j)]
          for jj = 1:(j-1)
              mul!(
                  Lij,
                  L[MixedModels.block(i, jj)],
                  L[MixedModels.block(j, jj)]',
                  -one(eltype(Lij)),
                  one(eltype(Lij)),
              )
          end
          rdiv!(Lij, LjjT')
      end
  end
  return nothing
end

_pwrss(L::Vector) = abs2(last(last(L)))

function _logdet(L::Vector, reterms::Vector{<:MixedModels.AbstractReMat}, REML::Bool)
  @inbounds s = sum(j -> MixedModels.LD(L[MixedModels.kp1choose2(j)]), axes(reterms, 1))
  if REML
      lastL = last(L)
      s += MixedModels.LD(lastL)        # this includes the log of sqrtpwrss
      s -= log(last(lastL)) # so we need to subtract it from the sum
  end
  return s + s  # multiply by 2 b/c the desired det is of the symmetric mat, not the factor
end

LinearAlgebra.copy_oftype(A::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T} =
  MixedModels.UniformBlockDiagonal(LinearAlgebra.copy_oftype(A.data, T))

function LinearAlgebra.copy_oftype(
  A::MixedModels.BlockedSparse{<:Any,S,P},
  ::Type{T},
) where {T,S,P}
  cscmat = LinearAlgebra.copy_oftype(A.cscmat, T)
  nzsasmat = LinearAlgebra.copy_oftype(A.nzsasmat, T)
  MixedModels.BlockedSparse{T,S,P}(cscmat, nzsasmat, A.colblkptr)
end

function LinearAlgebra.copy_oftype(A::MixedModels.ReMat{<:Any,S}, ::Type{T}) where {T,S}
  MixedModels.ReMat{T,S}(
      A.trm,
      A.refs,
      A.levels,
      A.cnames,
      LinearAlgebra.copy_oftype(A.z, T),
      LinearAlgebra.copy_oftype(A.wtz, T),
      LinearAlgebra.copy_oftype(A.λ, T),
      A.inds,
      LinearAlgebra.copy_oftype(A.adjA, T),
      LinearAlgebra.copy_oftype(A.scratch, T),
  )
end

function _cholUnblocked!(A::StridedMatrix, ::Type{Val{:L}})
  cholesky!(Hermitian(A, :L))
  return A
end
function _cholUnblocked!(D::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T}
  Ddat = D.data
  for k in axes(Ddat, 3)
      _cholUnblocked!(view(Ddat, :, :, k), T)
  end
  return D
end
function _cholUnblocked!(A::Diagonal, ::Type{T}) where {T}
  A.diag .= sqrt.(A.diag)
  return A
end
_cholUnblocked!(A::AbstractMatrix, ::Type{T}) where {T} = MixedModels.cholUnblocked!(A, T)

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
  A::StridedMatrix{T},
  α,
  β,
) where {T}
  Cdat = C.data.data
  LinearAlgebra.require_one_based_indexing(Cdat, A)
  isone(β) || rmul!(Cdat, β)
  blksize = size(Cdat, 1)

  for k in axes(Cdat, 3)
      ioffset = (k - 1) * blksize
      joffset = (k - 1) * blksize
      for i in axes(Cdat, 1), j = 1:i
          iind = ioffset + i
          jind = joffset + j
          AtAij = 0
          for idx in axes(A, 2)
              # because the second multiplicant is from A', swap index order
              AtAij += A[iind, idx] * A[jind, idx]
          end
          Cdat[i, j, k] += α * AtAij
      end
  end

  return C
end

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}},
  A::MixedModels.BlockedSparse{T,S},
  α,
  β,
) where {T,S}
  Ac = A.cscmat
  cp = Ac.colptr
  all(==(S), diff(cp)) ||
      throw(ArgumentError("Columns of A must have exactly $S nonzeros"))
  Cdat = C.data.data
  LinearAlgebra.require_one_based_indexing(Ac, Cdat)

  j, k, l = size(Cdat)
  S == j == k && div(Ac.m, S) == l ||
      throw(DimensionMismatch("div(A.cscmat.m, S) ≠ size(C.data.data, 3)"))
  nz = Ac.nzval
  rv = Ac.rowval

  @inbounds for j in axes(Ac, 2)
      nzr = nzrange(Ac, j)
      # BLAS.syr!('L', α, view(nz, nzr), view(Cdat, :, :, div(rv[last(nzr)], S)))
      _x = view(nz, nzr)
      view(Cdat, :, :, div(rv[last(nzr)], S)) .+= α .* _x .* _x'
  end

  return C
end

function _rankUpdate!(
  C::LinearAlgebra.HermOrSym{T,S},
  A::StridedMatrix{T},
  α,
  β,
) where {T,S}
  # BLAS.syrk!(C.uplo, 'N', T(α), A, T(β), C.data)
  C.data .*= β
  C.data .+= α .* A * A'
  return C
end

_rankUpdate!(C::AbstractMatrix, A::AbstractMatrix, α, β) =
  MixedModels.rankUpdate!(C, A, α, β)

Notice that this version of the objective function hasn't concentrated out σ. With this code, I can compute the Hessian which clearly shows the indefiniteness

julia> using ForwardDiff

julia> (objective(saddlepointfit), my_deviance(saddlepointfit, [_θ; saddlepointfit.sigma]))
(-284.3311306944413, -284.33113069939395)

julia>

julia> ForwardDiff.hessian(θσ -> my_deviance(saddlepointfit, θσ), [_θ; saddlepointfit.sigma])
6×6 Matrix{Float64}:
    1.04012         9.41825e-6      0.00227411   0.000198533   1.81058e-8   1469.2
    9.41825e-6      0.58946        -1.33622e-5  -0.0003711     0.000425082  1660.82
    0.00227411     -1.33622e-5      0.860294     0.0357195    -2.86937e-7   1510.92
    0.000198533    -0.0003711       0.0357195    0.288273     -2.35253e-6     -1.50027
    1.81058e-8      0.000425082    -2.86937e-7  -2.35253e-6   -0.051874        1.04004
 1469.2          1660.82         1510.92        -1.50027       1.04004         9.40305e6

Finally, we can use the ForwardDiff compatible code to fit the model with BFGS using ForwardDiff based derivatives

julia> my_fts = [my_fit(ft) for ft in fts];

julia> countmap(round.(exp.(getindex.(coef.(my_fts), 2)), digits=4))
Dict{Float64, Int64} with 1 entry:
  0.9329 => 500

so with BFGS, we never end up at the saddle point (for this dataset). Of course, we don't know if that is the case for other datasets but I wanted to share this observation to start the conversation. Maybe a gradient based optimization should be considered. It will of course require some work since the version I threw together here is rather allocating so can't be used right away but might still be useful as a starting point.

andreasnoack avatar Aug 10 '23 07:08 andreasnoack

@andreasnoack Can you show us versioninfo()?

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off. That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

In my own experience, BOBYQA rarely fails on well-posed problems, but we do support using other derivative-free algorithms implemented in NLopt. For example, if we change to COBYLA, things work:

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
formula = @formula(log(AUCT) ~ trt + seq + per +
                   (trt + 0 | sub) +
                   zerocorr(trt + 0 | row))
contrasts = Dict(:trt => DummyCoding(), :per => DummyCoding(),
                 :sub => Grouping(), :row => Grouping())
fts = @showprogress map(1:N) do _
    data = transform(saddlepointdata,
                     "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT")
    model = LinearMixedModel(formula, data; contrasts)
    model.optsum.optimizer = :LN_COBYLA
    return fit!(model; REML=true)
end

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

yields

Dict{Float64, Int64} with 1 entry:
  0.933 => 500

palday avatar Aug 10 '23 13:08 palday

We have five test dataset where we could reproduce the saddle point. When running the test script above with those datasets, COBYLA always avoids the saddle point solutions that BOBYQA sometimes terminate at.

It would be interesting to know if this an issue with the implementation of BOBYQA or an issue with the algorithm. cc @stevengj.

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off.

I'd be curious to read more about this. My expectation would be that it would be beneficial to use the gradient.

That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

andreasnoack avatar Aug 18 '23 06:08 andreasnoack

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

@andreasnoack definitely! That's been on my secret todo list for a very long time, but other things were always higher priority.

palday avatar Aug 18 '23 14:08 palday

@andreasnoack it's stale relative to main, but I think this was the most recent effort: https://github.com/JuliaStats/MixedModels.jl/tree/forwarddiff

palday avatar Aug 18 '23 14:08 palday

@andreasnoack see also https://github.com/JuliaStats/MixedModels.jl/issues/312

palday avatar Aug 22 '23 19:08 palday

As I mention in #742, I have been looking at switching MixedModels.jl to the PRIMA.jl version of bobyqa for the constrained, derivative-free optimization. @zaikunzhang, the developer of libprima, spent considerable amounts of time translating Powell's original f77 code to a modularized modern Fortran. Along the way he discovered and fixed a number of bugs.

I will let you know if I am able to apply PRIMA.bobyqa to this problem

dmbates avatar Feb 07 '24 16:02 dmbates

Because of #833, I revisited this issue. I just tried out the PRIMA backend and the issue persists. I.e.

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = map(1:N) do _
    m = LinearMixedModel(
        @formula(log(AUCT) ~
            trt +
            seq +
            per +
            (trt + 0 | sub) +
            zerocorr(trt + 0 | row)
        ),
        transform(
            saddlepointdata,
            "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
        );
        contrasts = Dict(
            [
                :trt,
                :per,
                :sub,
                :row,
            ] .=> Ref(DummyCoding())
        )
    )
    m.optsum.backend = :prima
    # m.optsum.optimizer = :cobyla
    m.optsum.optimizer = :bobyqa
    fit!(
        m;
        REML = true
    )
end

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

gives

julia> countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
Dict{Float64, Int64} with 2 entries:
  0.933 => 191
  0.93  => 309

and the issue goes away with :cobyla. This might be of interest to @zaikunzhang

andreasnoack avatar Jun 17 '25 13:06 andreasnoack

I don't understand the model. If you have random effects for trt by row then these random effects are confounded with the per-observation noise term. It is not surprising to me that estimation is unstable. One result that I got was

julia> m1 = ans
Linear mixed model fit by maximum likelihood
 :(log(AUCT)) ~ 1 + trt + seq + per + (trt + 0 | sub) + zerocorr(trt + 0 | row)
   logLik   -2 logLik     AIC       AICc        BIC    
   163.7786  -327.5571  -303.5571  -302.0571  -262.7792

Variance components:
         Column  Variance   Std.Dev.    Corr.
row      trt: R  0.00004520 0.00672329
         trt: T  0.05359023 0.23149563   .  
sub      trt: R  0.02513834 0.15855076
         trt: T  0.01178772 0.10857125 -0.18
Residual         0.00004303 0.00655984
 Number of obs: 221; levels of grouping factors: 221, 64

  Fixed-effects parameters:
──────────────────────────────────────────────────────
                   Coef.  Std. Error       z  Pr(>|z|)
──────────────────────────────────────────────────────
(Intercept)   4.62829     0.0285346   162.20    <1e-99
trt: T       -0.0697061   0.0344717    -2.02    0.0432
seq: TRTR     0.015153    0.0315155     0.48    0.6306
per: 2       -0.0543228   0.0345276    -1.57    0.1156
per: 3        0.00428555  0.00265393    1.61    0.1064
per: 4       -0.0570323   0.0345231    -1.65    0.0985
──────────────────────────────────────────────────────

julia> m1.θ
5-element Vector{Float64}:
  1.0249175097804113
 35.289831304567535
 24.169914254418064
 -2.9855064857546267
 16.279405622424537

Notice that the elements of θ, which are rarely greater than 1.0, are as large as 35. This is because all of the per-observation noise is being incorporated into the random effects by row. I'm not sure that it is meaningful to use zerocorr here because one of the random effects for each level of row is always zero.

julia> first(ranef(m1))
2×221 Matrix{Float64}:
  0.0       0.00424502  0.0      -0.00443598  0.0015443  …  0.0       -0.0039046   0.0        0.00303718  -0.00310098
 -0.393695  0.0         0.14303   0.0         0.0           0.279237   0.0        -0.0453793  0.0          0.0

I'm sure that SAS PROC MIXED users can convince themselves that this makes sense but I can't.

dmbates avatar Jun 25 '25 13:06 dmbates

The model might not make sense, but you are in a better position to tell the FDA than I am. However, even if the model doesn't make sense, it seems wrong to me that BOBYQA terminates at a saddle point. I've compiled a debug build of nlopt so I might try so see if I can make any sense of the extra output that can be enabled.

andreasnoack avatar Jul 01 '25 15:07 andreasnoack

As you said, the FDA used a SAS PROC MIXED formulation of the model so there is potentially a problem in translating the PROC MIXED formulation to the MixedModels.jl formulation. Even after 30 years of thinking about mixed-effects models there are some PROC MIXED specifications that I just don't understand.

If the model is over-specified, which I believe is the case, then most positions in the parameter space could represent saddle points.

The fact remains that this model confounds part of the zerocorr(0 + trt|row) random effects term with the residual noise term. As far as I can see the model is inestimable.

dmbates avatar Jul 01 '25 15:07 dmbates

I don't think that the by-row random effect is in the SAS model. I didn't see it in my reading of it, but I've pinged @ararslan who's actually worked with SAS before.

palday avatar Jul 01 '25 19:07 palday

There are two separate issues here. Both are interesting but I think it might be worthwhile to consider them separately

  1. Does the model make sense and does it do the same thing as the SAS version
  2. Is it a bug in BOBYQA that it terminates with SUCCESS at a saddle point. Here it doesn't matter what created the objective function.

Regarding 1., then clearly it is overparameterized. I'm shoehorning MixedModels into fitting a model with an R matrix, see the SAS docs, that isn't just $σ²I$ since the SAS formulation has two different sigmas, one for each formulation. The R part is what the REPEATED statement specifies. I can mimic the R behavior in MixedModels by introducing the "row factor" but then the residual variance parameter is redundant. My thinking was that it would be a minor issue with that redundancy as long as you didn't interpret the estimates of those parameters then it should give you the same result as SAS.

Regarding 2., then, as mentioned above, I think we can ignore how the objective function was created. Does it really make sense to return SUCCESS when terminating at a saddle point?

andreasnoack avatar Jul 01 '25 19:07 andreasnoack

  1. I think you need to set the residual standard deviation to make that trick work -- the σ kwarg to fit will handle that.

  2. I think it would be better if (this implementation of) BOBYQA didn't terminate at a saddle point, but I don't know if it's fully avoidable. BOBYQA is doing a quadratic approximation, so it may not "see" the saddle point. The PRIMA version's use of a confidence radius may be better than the formulation used in NLopt, but I haven't thought too deeply about this (I'm writing this quickly while waiting for CI at work to run). That said, I think the quadratic approximation bit is important here -- in an idealized mixed model model, the objective function is quadratic so BOBYQA will do well very well. Of course, neither models nor the data that they are fit to are ideal in real life, so we do need to be able to detect when things break down.

palday avatar Jul 01 '25 19:07 palday

I agree with @palday 's point 1. I'll try fitting the model with the residual standard deviation constrained to 1.0

BTW, if anyone else has tried that please let me know. I have been having difficulty with my vision due to cataracts. Right now they have operated on the right eye but it will be another week until I have the left eye done so I am in a weird half-blurry, half-sharp state w.r.t. my vision. It makes reading kind of challenging.

dmbates avatar Jul 01 '25 21:07 dmbates

The first here is using fit!(model; REML=true, σ=1.0) and the second does not specify σ:

julia> countmap(round.(exp.(getindex.(coef.(fts), 2)); digits=3))
Dict{Float64, Int64} with 1 entry:
  0.928 => 500

julia> countmap(round.(exp.(getindex.(coef.(fts0), 2)); digits=3))
Dict{Float64, Int64} with 2 entries:
  0.933 => 477
  0.93  => 23

ararslan avatar Jul 01 '25 21:07 ararslan

Thanks for looking at that, @ararslan .

I appreciate the insight but I'm still confused. I tried varying the value of the σ argument and it does tend to give the lowest value of the objective when it is fixed at around 0.009, which is where the optimum occurs if the σ argument is not specified. I'm really not sure what the criterion means at this point. I can believe that things go wonky with the definition of the log-likelihood or the REML criterion and there should be some correction term but right now I don't know what that would be.

dmbates avatar Jul 01 '25 21:07 dmbates

I read carefully all the discussions. However --- forgive my ignorance about Julia and the specific application --- I did not even figure out the following basic questions:

  1. What was the objective function?

  2. What were the decision variables?

  3. What were the constraints?

  4. What were the solutions produced by different solvers?

  5. What does the nice curve mean in the original question posted two years ago?

  6. Why was it obvious that the solution produced by BOBYQA was a saddle point rather than a (local) minimum?

From a theoretical point of view, generically, it is almost impossible for a optimization solver --- sloppy or good --- to converge to a "saddle point", as such points are unstable.

Thanks.

zaikunzhang avatar Jul 01 '25 22:07 zaikunzhang

@ararslan I think the value would have to be sufficiently small to not impose a restriction on the model since the components cannot be negative. The estimated sigmas in the output that @dmbates shared are 0.00004520 and 0.05359023 for the R and T specific variances and 0.00004303 for the residual variance so I think the pinned σ should be no greater than

julia> sqrt(0.00004520 + 0.00004303)
0.009393082561119113

to avoid restricting the model. With σ=0.009, I'm getting

julia> countmap(round.(exp.(getindex.(coef.(fts), 2)); digits=3))
Dict{Float64, Int64} with 2 entries:
  0.933 => 492
  0.93  => 8

BOBYQA is doing a quadratic approximation, so it may not "see" the saddle point.

@palday I only have a high level understanding BOBYQA but I have also had this thought. However, if that is the case then it is a severe limitation in the method that should be mentioned in the description of it and suggests that a quasi Newton method might be a better option. We only detected this issue because of an extensive comparison against SAS results so there might be several MixedModels fits that hits this issue.

@zaikunzhang thanks for joining the conversation.

  1. What was the objective function?

It is the restricted maximum likelihood objective function of a mixed effects model after concentrating out the fixed effects parameters. There are some details in https://juliastats.org/MixedModels.jl/stable/optimization/ but the code in the first is a complete example, although in Julia.

  1. What were the decision variables?

The elements of the Cholesky factors of the random effects, i.e. elements of $\Lambda_\theta$ in the link above.

The diagonal elements of the Cholesky factors are restricted to be positive. The off-diagonal elements are unrestricted.

  1. What were the solutions produced by different solvers?

BFGS and COBYLA consistently give 0.9329 (although I recently hit https://github.com/JuliaStats/MixedModels.jl/issues/833 with the latter)

  1. What does the nice curve mean in the original question posted two years ago?

It is the objective function around the saddle point solution as a function of the fifth decision variable.

  1. Why was it obvious that the solution produced by BOBYQA was a saddle point rather than a (local) minimum?

The graph in the figure is clearly concave at zero (the "solution" returned by nlopt and PRIMA) in the fifth coordinate. I also included the Hessian in the first post which clearly isn't positive definite.

andreasnoack avatar Jul 02 '25 11:07 andreasnoack

@zaikunzhang The derivation and interpretation of the objective is described more fully in https://arxiv.org/abs/2505.11674

From the point of view of the optimization problem, the objective is a function of, in this case, a parameter vector of length 5. A summary of the optimization stage for one of these examples is

julia> m.optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value:  38318.58532665327

Backend:                  prima
Optimizer:                cobyla
Lower bounds:             [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg:                   1.0
rhoend:                   1.0e-6
maxfeval:                 -1

Function evaluations:     1306
xtol_zero_abs:            0.001
ftol_zero_abs:            1.0e-5
Final parameter vector:   [0.3681424453109934, 25.722562383726956, 17.90268794574329, -2.207715770191636, 12.560645490417164]
Final objective value:    -286.16940754656434
Return code:              SMALL_TR_RADIUS

The constraints on this parameter vector are lower bounds on four of the five scalar values, which are constrained to be non-negative.

The representation of the statistical model converts these parameters into small, lower triangular matrices that are the lower Cholesky factors of covariance matrices (which must be positive semi-definite).

julia> m.θ
5-element Vector{Float64}:
  0.3681424453109934
 25.722562383726956
 17.90268794574329
 -2.207715770191636
 12.560645490417164

julia> println.(getproperty.(m.reterms, :λ));
[0.3681424453109934 0.0; 0.0 25.722562383726956]
[17.90268794574329 0.0; -2.207715770191636 12.560645490417164]

For this model the first of these Cholesky factors is diagonal and the second is lower triangular.

The profiled objective, which is negative twice the log-likelihood (or the restricted log-likelihood), is a function of these parameters only. There are other parameters in the statistical model but their optimal values, given these parameters, can be evaluated without iteration. (Exactly how this is done is the subject of the arxiv paper mentioned above.)

The nature of the model means that the starting values, [1.0, 1.0, 1.0, 0.0, 1.0], are chosen so that the Cholesky factors generated from these are identity matrices.

If you wish we can provide the parameter values and the corresponding objective values during the course of the optimization using bobyqa and using cobyla. Evaluation of the objective itself is somewhat involved and our Julia code may not be easily translated into other programming languages. We use Julia's multiple dispatch capabilities extensively and also accelerated BLAS implementations. So even on the same processor family and operating system there may be very slight differences in the value of the objective when, say, different BLAS implementations change the order of operations in the Lapack function dpotrf.

The behind-the-scenes problem we are discussing is in the application of the statistical model in the analysis of pharmacokinetic data for submission to the Food and Drug Administration (FDA) in the United States. Their recommendations on how to perform the analysis uses proprietary software from SAS Institute and their algorithms are not disclosed. It is a difficult situation of trying to reproduce results from other software without a detailed description of how that software works.

Please let us know if there is any further information we can provide for you.

dmbates avatar Jul 02 '25 14:07 dmbates

Thank you @andreasnoack for the explanation. I understand things better now but still not completely.

Let me try. So the solution returned by COBYLA and BFGS is 0.9329, which is considered correct, but BOBYQA returns 0.9301, which is believed to be saddle point. Is this right?

If that is the case, could you check

  1. What are the function values corresponding to 0.9328 and 0.9301?

  2. What is your stopping criteria? Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?

Let me repeat: generically, it is impossible for any solver to converge to a saddle point.

zaikunzhang avatar Jul 02 '25 15:07 zaikunzhang

Thank you @andreasnoack for the explanation. I understand things better now but still not completely.

Let me try. So the solution returned by COBYLA and BFGS is 0.9329, which is considered correct, but BOBYQA returns 0.9301, which is believed to be saddle point. Is this right?

If that is the case, could you check

  1. What are the function values corresponding to 0.9328 and 0.9301?

  2. What is your stopping criteria? Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?

Let me repeat: generically (it has a particular mathematical meaning), it is impossible for any solver to converge to a saddle point.

zaikunzhang avatar Jul 02 '25 15:07 zaikunzhang

@zaikunzhang I should add that to generate the output

Dict{Float64, Int64} with 2 entries:
  0.9329 => 483
  0.9301 => 17

that shows the two different solutions, I'm adding a tiny random perturbation to the data at each of 500 times I'm running the optimization. Based on those 500 fits, I'm getting

julia> combine(groupby(tmp_df, "xval"), "fval" .=> [mean, var])
2×3 DataFrame
 Row │ xval     fval_mean  fval_var
     │ Float64  Float64    Float64
─────┼─────────────────────────────────
   1 │   0.93    -284.331  2.70211e-11
   2 │   0.933   -286.169  1.05848e-11

which gives you the function values associated with the two termination values.

What is your stopping criteria?

Please see https://github.com/JuliaStats/MixedModels.jl/issues/705#issuecomment-3028183418, which, I believe, should have this information.

Could you try setting the maximal number of function evaluations to 1000*n and the smallest trust region radius to machine epsilon?

I interpret this as setting rhoend. That essentially gives the same result and I'm not reaching that maximum number of iterations

julia> maximum([ft.optsum.feval for ft in fts])
694

Update: Let me share the optimization output for one of each results

Bad

julia> fts[1].optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value:  -44.70893444134606

Backend:                  prima
Optimizer:                bobyqa
Lower bounds:             [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg:                   1.0
rhoend:                   2.220446049250313e-16
maxfeval:                 -1

Function evaluations:     325
xtol_zero_abs:            0.001
ftol_zero_abs:            1.0e-5
Final parameter vector:   [0.2618306642347577, 27.735563195267034, 17.367158317488414, -2.1765460166731314, 0.0]
Final objective value:    -284.33112518002326
Return code:              SMALL_TR_RADIUS

Good

julia> fts[6].optsum
Initial parameter vector: [1.0, 1.0, 1.0, 0.0, 1.0]
Initial objective value:  -44.708934441224415

Backend:                  prima
Optimizer:                bobyqa
Lower bounds:             [0.0, 0.0, 0.0, -Inf, 0.0]
rhobeg:                   1.0
rhoend:                   2.220446049250313e-16
maxfeval:                 -1

Function evaluations:     583
xtol_zero_abs:            0.001
ftol_zero_abs:            1.0e-5
Final parameter vector:   [0.7466427813690414, 30.138734078056896, 20.96629512954934, -2.5830367079671253, 14.69080363737362]
Final objective value:    -286.1694145110057
Return code:              SMALL_TR_RADIUS

andreasnoack avatar Jul 02 '25 18:07 andreasnoack

FWIW, here are the optimization traces for a good and a bad fit

Image

andreasnoack avatar Jul 02 '25 19:07 andreasnoack

xtol_zero_abs: 0.001 ftol_zero_abs: 1.0e-5

What are these parameters? Are they part of the stopping criteria? They are not native PRIMA parameters. They probably come from the upper-level interface. Could you set them to the values that will not interfere the termination (maybe 0)? Thanks.

BTW, is your objective function smooth? Noisy?

zaikunzhang avatar Jul 02 '25 19:07 zaikunzhang

@palday I can see that you added them in https://github.com/JuliaStats/MixedModels.jl/commit/34899cf6d95abb2df8c961c9de5a6a500351ebff. What do they do? I tried to set them zero and it doesn't seem to make a difference.

andreasnoack avatar Jul 02 '25 19:07 andreasnoack

The MixedModels.jl code can use either the NLopt versions (Julia package is NLopt.jl) or the libprima versions (Julia package is PRIMA.jl) of BOBYQA or COBYLA. The internal structure that holds the parameters for the optimizers represents the union of the parameters for the two different implementations. It is possible that a couple of the NLopt parameters leaked into the .optsum display for the PRIMA version. In other words, the structure contains the optimizer parameters for both implementations. The display should show only those from the PRIMA optimizer when PRIMA is in use but it may accidentally contain a couple of parameters from NLopt.

dmbates avatar Jul 02 '25 20:07 dmbates

Looking at the traces it seems that the big difference is whether x5 gets away from zero, its lower bound.

dmbates avatar Jul 02 '25 20:07 dmbates

The traces also show me something about the first plot (the one that looks like a "bell curve" with upturned edges) that I didn't understand. That curve should not have negative values on the x-axis because of the lower bound. It is not a matter of converging to a saddle point as much as it is a matter of converging on the boundary.

So if the iterations can get away from that boundary around 250 evaluations or so then they proceed to the optimum - otherwise they converge on the boundary

dmbates avatar Jul 02 '25 20:07 dmbates

It is not a matter of converging to a saddle point as much as it is a matter of converging on the boundary.

Ah. I didn't notice ealier that the fifth element was at the boundary in the bad runs. Maybe the combination of being at the boundary and a zero gradient is a problem? @zaikunzhang thoughts on that?

I tried to change the lower bounds to -Inf for the parameters associated with the triangular factor. The positive sign is only a convention and not needed for the result to be valid. Without the lower bounds, I'm getting

julia> countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))
Dict{Float64, Int64} with 1 entry:
  0.933 => 500

i.e. not bad solutions. This raises two questions

  1. Is the behavior of BOBYQA expected when at the boundary?
  2. Should MixedModels stop imposing non-negativity constraints on Cholesky factor parameters?

andreasnoack avatar Jul 03 '25 08:07 andreasnoack

  1. Is the behavior of BOBYQA expected when at the boundary?

Things become clearer now. It is due to the bound constraint.

This is not unexpected. Indeed, any method that does not exploit second-order (curvature) information may get trapped by such a "saddle point" when solving bound constraints problems. This is different from the unconstrained case, where convergence to saddle points is generically impossible for a proper algorithm.

This is not unique to BOBYQA. I believe COBYLA and (finite-difference) BFGS can encounter the same problem if your test is sufficiently extensive. Even for algorithms using exact gradients (rather than derivative-free methods), it is not difficult to construct non-pathological examples where the problem occurs often.

zaikunzhang avatar Jul 03 '25 10:07 zaikunzhang