LineSearches.jl icon indicating copy to clipboard operation
LineSearches.jl copied to clipboard

HZ: avoid convergence if bisected

Open timholy opened this issue 1 year ago • 26 comments

secant2 calls update, and update might switch to bisection in the U3 step (as named in the initial HZ publication). If we did bisect, the line-search "are we making enough progress?" check (step L2) is nonsensical (we might have shrunk multiple iterations of bisection, but that is not an indication that the secant model was "working").

Consequently, this reports back about whether bisection was engaged in update, and if so skip any kind of convergence assessment and do another iteration.

Fixes #173. Sadly there isn't a very portable test.

timholy avatar Jan 28 '24 19:01 timholy

Thanks @timholy really appreciate it. I'll try to find time to review it later today .

pkofod avatar Jan 29 '24 09:01 pkofod

Thank you! This really turned out to be more involved than I initially expected.

mateuszbaran avatar Jan 29 '24 10:01 mateuszbaran

Bump?

mateuszbaran avatar Feb 08 '24 12:02 mateuszbaran

Okay, this looks good to me. @mateuszbaran did you somehow check if this solved your original problem?

pkofod avatar Feb 12 '24 11:02 pkofod

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

mateuszbaran avatar Feb 12 '24 12:02 mateuszbaran

This script can easily generate cases where such choice is made:

using Revise
using Manopt
using Manifolds
using LineSearches
using LinearAlgebra

norm_inf(M::AbstractManifold, p, X) = norm(X, Inf)

function f_rosenbrock(x)
    result = 0.0
    for i in 1:2:length(x)
        result += (1.0 - x[i])^2 + 100.0 * (x[i + 1] - x[i]^2)^2
    end
    return result
end
function f_rosenbrock(::AbstractManifold, x)
    return f_rosenbrock(x)
end

function g_rosenbrock!(storage, x)
    for i in 1:2:length(x)
        storage[i] = -2.0 * (1.0 - x[i]) - 400.0 * (x[i + 1] - x[i]^2) * x[i]
        storage[i + 1] = 200.0 * (x[i + 1] - x[i]^2)
    end
    return storage
end

function g_rosenbrock!(M::AbstractManifold, storage, x)
    g_rosenbrock!(storage, x)
    riemannian_gradient!(M, storage, x, storage)
    return storage
end

function test_case_manopt()
    for i in 2:2:1000
        N = 5000+i
        mem_len = rand(4:10)
        println("i = $i, mem_len=$mem_len")
        M = Manifolds.Sphere(N - 1; parameter=:field)
        ls_hz = LineSearches.HagerZhang()

        x0 = zeros(N)
        x0[1] = 1
        manopt_sc = StopWhenGradientNormLess(1e-6; norm=norm_inf) | StopAfterIteration(1000)

        quasi_Newton(
            M,
            f_rosenbrock,
            g_rosenbrock!,
            x0;
            stepsize=Manopt.LineSearchesStepsize(ls_hz),
            #stepsize=HagerZhangLinesearch(M),
            evaluation=InplaceEvaluation(),
            vector_transport_method=ParallelTransport(),
            return_state=true,
            memory_size=mem_len,
            stopping_criterion=manopt_sc,
        )
    end
end
test_case_manopt()

You can then add if values[ib] < values[iA] || values[iB] < values[iA] check in HZ to see if the choice is indeed suboptimal.

mateuszbaran avatar Feb 12 '24 12:02 mateuszbaran

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

have you decided to reset the approximation in that case?

pkofod avatar Feb 12 '24 14:02 pkofod

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

mateuszbaran avatar Feb 12 '24 14:02 mateuszbaran

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

Okay. I had thought to change Optim to reset in this case which is why I asked :)

pkofod avatar Feb 12 '24 16:02 pkofod

I had a chat about it with Roland Herzog some time ago and he suggested using truncated CG is such case but restarting is also a valid option.

mateuszbaran avatar Feb 12 '24 16:02 mateuszbaran

@timholy do you have any thoughts on the above comments by @mateuszbaran ?

pkofod avatar Feb 14 '24 19:02 pkofod

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

mateuszbaran avatar Feb 27 '24 13:02 mateuszbaran

Bump?

mateuszbaran avatar Mar 23 '24 13:03 mateuszbaran

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

I think Tim may have added that. I suppose you're saying that maybe it's failing here partially because the branch doesn't belong?

pkofod avatar Mar 23 '24 14:03 pkofod

There is clearly a difference between how it's commented and what it actually does. It's possible that termination condition actually catches some corner and in some cases is a good idea but currently it also catches things it shouldn't catch (including in this PR). I don't know that corner case it was supposed to handle so I don't know how to properly fix it.

Anyway, there is always that simple change I've done that works well enough for me and shouldn't really harm any use case.

There is already at least one person (other than me and Ronny) who had issues because my fixed HZ linesearch is what works best for his problem but he would like to use it in Pluto which doesn't like non-registered packages. So it would be nice to have a good HZ linesearch in a registered package.

mateuszbaran avatar Mar 23 '24 15:03 mateuszbaran

Sorry I haven't had time for "general"/"charitable" development lately. My busy schedule is likely to persist another couple of weeks at least. If anyone wants to take this effort over, by all means go for it.

A couple of high-level thoughts:

  • the Julia implementaton of HZ linesearch should be in this package, and it should be good. It's important to fix this and release a new version.
  • this episode is a reminder that in some ways, the test suite is more valuable than the implementation. My extensions to the published algorithm were almost certainly motivated by real-world cases, but those are lost to the mists of time. My inability to reconstruct those motivations is at risk of paralyzing any decision here. If push comes to shove, it might be better to revert to the published algorithm and force me to deal with the consequences.
  • Likewise, the issues uncovered by @mateuszbaran depend on a complicated software stack that may be more than one wants to depend on for a test suite: updates to a large ecosystem can have a way of making such things fragile. So we're at risk for losing yet more important test cases if we don't do something.
  • To me, it seems that the most important step is to develop a testing layer where we capture the outcome of the one-dimensional problem, as that removes dependence from the "real" problem and allows us to create small tests without a large dependency stack. The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.) Then we could have a test suite that consists of vectors of alpha, phi, and dphi values for all the "bad" cases anyone reports.
  • Failing to do this now, while we have real-world failing test cases, is just a way of perpetuating the inability to keep this package at the forefront of Julia implementations of line search methods. Which would be a giant waste.

Thus if someone wants to help move this forward, perhaps the best thing to do is focus on developing that test harness.

timholy avatar Mar 25 '24 14:03 timholy

Thank you, your observations are very on-point. To push this forward I can prepare a simplified test case, either this week or the next one. Anyway, I'm not a maintainer of this package so I can't decide how to fix this issue.

mateuszbaran avatar Mar 25 '24 15:03 mateuszbaran

I have made a self-contained example based on cubic interpolation. It somehow doesn't reproduce the original problem -- I will look for the right test values but that's all I could do in the time I've found so far.

struct LineSearchTestCase
    alphas::Vector{Float64}
    values::Vector{Float64}
    slopes::Vector{Float64}
end


function prepare_test_case(; alphas, values, slopes)
    perm = sortperm(alphas)
    alphas = alphas[perm]
    push!(alphas, alphas[end]+1)
    values = values[perm]
    push!(values, values[end])
    slopes = slopes[perm]
    push!(slopes, 0.0)
    return LineSearchTestCase(alphas, values, slopes)
end

tc1 = prepare_test_case(;
    alphas = [0.0, 1.0, 5.0, 3.541670844449739],
    values = [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876],
    slopes = [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057]
)

function tc_to_f(tc)
    function f(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        return val
    end
end

function tc_to_fdf(tc)
    function fdf(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        h00tp = 6t^2 - 6t
        h10tp = 3t^2 - 4t + 1
        h01tp = -6t^2 + 6 * t
        h11tp = 3t^2 - 2t
        slope =
            (
                h00tp * tc.values[i] +
                h10tp * dx * tc.slopes[i] +
                h01tp * tc.values[i + 1] +
                h11tp * dx * tc.slopes[i + 1]
            ) / dx
        println(x, " ", val, " ", slope)
        return val, slope
    end
end

using LineSearches

function test_tc(tc)
    hz = HagerZhang()
    f = tc_to_f(tc)
    fdf = tc_to_fdf(tc)
    hz(f, fdf, 1.0, fdf(0.0)...)
end

test_tc(tc1)

EDIT: new testing code, where the best searched value is 2891.4462095232184 but HZ returns the point at which the objective is equal to 3000.9760725116876

mateuszbaran avatar Apr 08 '24 17:04 mateuszbaran

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

mateuszbaran avatar Apr 09 '24 08:04 mateuszbaran

Thank you so much Mateusz! I hope to have a little more time next week than in the past months :)

pkofod avatar Apr 12 '24 09:04 pkofod

Not forgotten! I will use your test-case to look at this next week.

pkofod avatar Apr 18 '24 20:04 pkofod

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

You have been patient :) I had some deadlines that were quite important. I fixed some Optim.jl stuff yesterday evening, and I will try to look at this when I'm off work.

pkofod avatar Apr 30 '24 06:04 pkofod

As per https://github.com/JuliaNLSolvers/LineSearches.jl/issues/175, I encountered another issue where the flatness check causes early termination at a point that is not close to stationary, which does not seem to be resolved by this PR.

The problematic sequence of ϕ and data are these:

cs = [0, 0.2, 0.1, 0.055223623837016025]
ϕs = [3.042968312396456, 3.117411287254072, -3.503584823341612, 0.5244246783084083]
dϕs = [-832.4270136930788, -505.33622492291033, 674.947830358913, 738.3388472434362]

This bug took a while for me to diagnose (typical users will not be suspecting a problem in LineSearches). Maybe as a band-aid we can add a check inside that flatness check -- if the norm of the gradient does not seem small, then print a warning for the user that there is a known bug with this part of LineSearches?

My original function is oscillatory, but like @timholy says above, one can just as well create a polynomial giving the same data:

The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.)

I created the package HermiteInterpolation for exactly this purpose (submitted for registration), and can be used like this:

using HermiteInterpolation: fit
f = fit(cs, ϕs, dϕs)
@assert f.(cs) ≈ ϕs

using DynamicPolynomials
@polyvar c

println(f(c))
# 3.042968312396456 - 832.4270136930788*c - 132807.15591801773*c^2 + 7.915421661743959e6*c^3 - 1.570284840040962e8*c^4 + 1.4221708747294645e9*c^5 - 5.970065591205604e9*c^6 + 9.405512899903618e9*c^7

println(differentiate(f(c), c))
# -832.4270136930788 - 265614.31183603546*c + 2.3746264985231876e7*c^2 - 6.281139360163848e8*c^3 + 7.110854373647323e9*c^4 - 3.582039354723362e10*c^5 + 6.5838590299325325e10*c^6

Incidentally, as a temporary workaround for my optimization problem, I came up with this sequence:

    method = Optim.LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2))
    res0 = Optim.optimize(f, g!, collect(reinterpret(Float64, params)), method, options)
    res = Optim.optimize(f, g!, Optim.minimizer(res0), Optim.ConjugateGradient(), options)

LBFGS with BackTracking search seems to converge robustly, and once it's near the local minimum, then ConjugateGradient can effectively fine tune minimizer.

kbarros avatar May 14 '24 14:05 kbarros

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

# The minimizer is x0=[0, 2πn/100], with f(x0) = 1. Any integer n is fine.
function f(x)
    return (x[1]^2 + 1) * (2 - cos(100*x[2]))
end

using Optim

function test_converges(method)
    for i in 1:100
        r = randn(2)
        res = optimize(f, r, method)
        if Optim.converged(res) && minimum(res) > f([0,0]) + 1e-8
            println("""
                Incorrectly reported convergence after $(res.iterations) iterations
                Reached x = $(Optim.minimizer(res)) with f(x) = $(minimum(res))
                """)
        end
    end
end

# Works successfully, no printed output
test_converges(LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2)))

# Prints ~10 failures to converge (in 100 tries). Frequently fails after the
# first line search.
test_converges(ConjugateGradient())

kbarros avatar May 18 '24 16:05 kbarros

Gentle ping -- any hope for tightening up the HZ "flatness check"? The randomized optimization problem above should work pretty robustly as a unit test.

kbarros avatar Aug 20 '24 01:08 kbarros

I'll get to this in another week or two.

timholy avatar Aug 29 '24 14:08 timholy

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

I think this may highlight the same issue as in https://github.com/JuliaNLSolvers/LineSearches.jl/pull/174#issuecomment-2043311432

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

With my own simple branch made for checking this I am able to run

julia> test_converges(ConjugateGradient(; linesearch=Optim.LineSearches.HagerZhang(check_flatness=false)))

julia> test_converges(ConjugateGradient(; linesearch=Optim.LineSearches.HagerZhang(check_flatness=true)))
Incorrectly reported convergence after 1 iterations
Reached x = [0.16013274827601717, 0.35097409719940265] with f(x) = 2.9310451880340986

Incorrectly reported convergence after 1 iterations
Reached x = [0.0507085888328239, 0.1084483256815114] with f(x) = 2.1557003490556643

Incorrectly reported convergence after 2 iterations
Reached x = [0.7211228462970422, -22.3471576683635] with f(x) = 3.8050518641095663

Incorrectly reported convergence after 1 iterations
Reached x = [-1.2589697390692955, -1.0403076610453794] with f(x) = 7.5909348717323635

Incorrectly reported convergence after 1 iterations
Reached x = [0.08409600277401064, 0.10725179410260108] with f(x) = 2.2831453152517023

Incorrectly reported convergence after 2 iterations
Reached x = [0.9010705241444945, 0.7330566636590027] with f(x) = 4.526934892232607

Incorrectly reported convergence after 2 iterations
Reached x = [-0.27240989049521974, 11.156588470976192] with f(x) = 3.1405412374583666

Incorrectly reported convergence after 2 iterations
Reached x = [0.4717492439353791, -15.742921266415511] with f(x) = 3.5917484672484252

Incorrectly reported convergence after 1 iterations
Reached x = [-1.4143709712675014, 0.3441070259740454] with f(x) = 8.969056311635578

Incorrectly reported convergence after 1 iterations
Reached x = [0.29458492668827274, 1.1477725568576247] with f(x) = 2.291918955272849

Incorrectly reported convergence after 1 iterations
Reached x = [-0.19214192623076573, -0.14209344321079873] with f(x) = 2.1486141359671405

Incorrectly reported convergence after 4 iterations
Reached x = [-0.21665091042117182, 43.27319699842477] with f(x) = 2.3266172361393616

Incorrectly reported convergence after 1 iterations
Reached x = [-0.6789087745560297, 0.5133014956972761] with f(x) = 2.213588359372836

and running @mateuszbaran problem here https://github.com/JuliaNLSolvers/LineSearches.jl/pull/174#issuecomment-2043311432 I get

julia> res, res_cache = test_tc(tc1, true)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
((3.541670844449739, 3000.9760725116876), LineSearchCache{Float64}([0.0, 1.0, 5.0, 3.541670844449739], [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876], [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057]))

julia> @test_broken minimum(res_cache.values) == res[2]

Test Broken
  Expression: minimum(res_cache.values) == res[2]

julia> res2, res_cache2 = test_tc(tc1, false)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
4.49745720537289 -2137.605778336924 7059.04612357025
((4.49745720537289, -2137.605778336924), LineSearchCache{Float64}([0.0, 1.0, 5.0, 3.541670844449739, 4.49745720537289], [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876, -2137.605778336924], [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057, 7059.04612357025]))

julia> @test minimum(res_cache2.values) == res2[2]
Test Passed

with this plot of the problematic case

image

so based on the plot and the order of alphas above we see that we first check 1 that has downwards slope then 5 that has upwards. We additionally seem to come up with 3.54... that has downwards slop and then we get a bracket [3.54..., 5.0] that must contain a minimum. But nothing here suggests that the function should be flat between 3.54... and 5.0. The flatness check seems to check an unfortunate condition that

nextfloat(3000.9760725116876) >=  2891.4462095232184

in other words A is 3.54... and B = 5.0 so B > A as suggested in the assertion but why would we know that

nextfloat(values[ia]) >= values[ib] && nextfloat(values[iA]) >= values[iB]

is then an appropriate test for flatness if nextfloat(values[iA]) >= values[iB]? It seems to expect that values[iA] == values[iB] or values[iA] < values[iB] or am I misunderstanding something in this flatness check?

pkofod avatar Mar 26 '25 05:03 pkofod

Yes, I think it's making some assumptions about what those indices mean. It's basically expecting that it narrowed the interval and is still within epsilon of flat. We could probably add some additional sanity checks and maybe it would happen less? But your red dots, if they correspond to ia, ib, iA, and iB, indicate that it's incorrect about the meaning of those indices.

timholy avatar Apr 09 '25 16:04 timholy

I think (ia, ib, iA, iB) = (0, 1, 3.54..., 5.0) above but I have to go back and print :)

pkofod avatar Apr 09 '25 17:04 pkofod

I tried modifying the "flatness test" to check closeness rather than inequality and I got

julia> test_tc(tc1)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
(ia, ib) = (1, 2)
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
4.49745720537289 -2137.605778336924 7059.04612357025
(4.49745720537289, -2137.605778336924)

which seems to be a much better point for @mateuszbaran problem!

pkofod avatar Apr 10 '25 14:04 pkofod