Optim.jl icon indicating copy to clipboard operation
Optim.jl copied to clipboard

LBFGS iterations depend on whether bound checking is active

Open goerz opened this issue 2 years ago • 17 comments

I'm using LBFGS to run the optimization at https://nbviewer.org/gist/goerz/785b6052e3e5c69cece991d007be19d7/simple_state_to_state.ipynb

This is the actual call to LBFGS in the underlying GRAPE package:

https://github.com/JuliaQuantumControl/GRAPE.jl/blob/930d1593468a37225f0fe393e19cabc2a38af713/src/optimize.jl#L328-L344

I seem to be observing that Optim.jl behaves differently depending on which flags I'm running Julia with, most notably whether or not I run Julia with --check-bounds="yes"

Without forced bounds checking (normal REPL session), I get the following optimization trace:

* Example 1 (examples/simple_state_to_state.jl):
Iter     Function value   Gradient norm
     0     9.514590e-01     3.910143e-03
 * Current step size: 1.0
 * g(x): [0.003910143312153535, 0.00389185946360387, ... , 0.0038918594636039128, 0.003910143312153574]
 * x: [-2.7755575615628915e-18, 0.0004498836583447907, ... , 0.00044988365834480184, -2.7755575615628915e-18]
     1     9.184734e-01     5.483692e-03
 * Current step size: 3075.044530765617
 * g(x): [0.00548268584831715, 0.005481344487930618, ... , 0.0054813444879306845, 0.0054826858483172135]
 * x: [-12.023864806547483, -11.967191274405144, ... , -11.967191274405275, -12.023864806547603]
     2     9.137205e-01     5.620519e-03
 * Current step size: 4.652119264210941
 * g(x): [0.005601410513859061, 0.005601363978498371, ... , 0.005601363978500167, 0.005601410513860864]
 * x: [-75.60690496614599, -75.31763796966332, ... , -75.31763796966416, -75.60690496614677]
     3     2.023652e-02     2.815586e-03
 * Current step size: 0.373053629733721
 * g(x): [-0.002785392101726243, -0.002785401827233983, ... , -0.0027854018272285785, -0.002785392101720847]
 * x: [-672.710406667093, -670.213106500147, ... , -670.2131065002962, -672.7104066672426]
     4     8.817667e-04     7.593890e-05
 * Current step size: 0.0007512761292748913
 * g(x): [-2.2044995463188464e-5, -2.2057860759390016e-5, ... , -2.2057860750299343e-5, -2.204499545409266e-5]
 * x: [-672.5633407172459, -670.066585460024, ... , -670.0665854601733, -672.5633407173955]

This matches what's in the linked notebook.

With bound checking (as done by Pkg.test(), but I've verified this manually too):

* Example 1 (examples/simple_state_to_state.jl):
Iter     Function value   Gradient norm
     0     9.514590e-01     3.910143e-03
 * Current step size: 1.0
 * g(x): [0.003910143312153535, 0.00389185946360387, ... , 0.0038918594636039128, 0.003910143312153574]
 * x: [-2.7755575615628915e-18, 0.0004498836583447907, ... , 0.00044988365834480184, -2.7755575615628915e-18]
     1     9.184734e-01     5.483692e-03
 * Current step size: 3075.044530765617
 * g(x): [0.00548268584831715, 0.005481344487930618, ... , 0.0054813444879306845, 0.0054826858483172135]
 * x: [-12.023864806547483, -11.967191274405144, ... , -11.967191274405275, -12.023864806547603]
     2     9.137205e-01     5.620519e-03
 * Current step size: 4.652119264211364
 * g(x): [0.005601410513949036, 0.005601363978588344, ... , 0.00560136397859015, 0.005601410513950853]
 * x: [-75.60690496615172, -75.31763796966902, ... , -75.31763796966987, -75.60690496615248]
     3     4.957191e-01     9.990670e-03
 * Current step size: 0.3393703816763038
 * g(x): [0.00994589491037895, 0.009945890807204606, ... , 0.009945890807271162, 0.009945894910445499]
 * x: [-618.7975542338356, -616.4996187769607, ... , -616.4996187770973, -618.7975542339726]
     4     6.670505e-03     2.582736e-04
 * Current step size: 15.390536155598872
 * g(x): [0.0001957685549398441, 0.00019576332074427603, ... , 0.00019576332080496658, 0.00019576855500053343]
 * x: [-618.9506268890536, -616.6526913690286, ... , -616.6526913691663, -618.9506268891917]
     5     6.554008e-03     4.258734e-05
 * Current step size: 0.64412568931522
 * g(x): [-1.993721801600513e-5, -1.9942440553845517e-5, ... , -1.9942440493064428e-5, -1.9937217955225417e-5]
 * x: [-618.9526067305096, -616.654671156316, ... , -616.6546711564544, -618.9526067306483]
     6     1.461014e-04     1.693921e-05
 * Current step size: 2197.444286400128
 * g(x): [-1.3432561495974644e-5, -1.343340128733725e-5, ... , -1.3433401273156338e-5, -1.3432561481789187e-5]
 * x: [-618.5542774831223, -616.2562249894997, ... , -616.2562249909987, -618.5542774846216]

I can reproduce the above two traces on my Macbook as well as on a Linux server.

On Github CI (with Pkg.test(), i.e. bounds checking) I'm getting an even different result:

* Example 1 (examples/simple_state_to_state.jl):
Iter     Function value   Gradient norm
     0     9.514590e-01     3.910143e-03
 * Current step size: 1.0
 * time: 0.028926849365234375
 * g(x): [0.003910143312153538, 0.0038918594636038724, ... , 0.0038918594636039136, 0.003910143312153576]
 * x: [-2.7755575615628915e-18, 0.0004498836583447907, ... , 0.00044988365834480184, -2.7755575615628915e-18]
     1     9.184734e-01     5.483692e-03
 * Current step size: 3075.044530765618
 * time: 2.1887259483337402
 * g(x): [0.005482685848317463, 0.00548134448793093, ... , 0.005481344487930989, 0.005482685848317516]
 * x: [-12.023864806547497, -11.967191274405156, ... , -11.967191274405282, -12.023864806547614]
     2     9.137205e-01     5.620519e-03
 * Current step size: 4.6521192642314135
 * time: 2.2652180194854736
 * g(x): [0.005601410518216981, 0.005601363982856158, ... , 0.0056013639828579025, 0.005601410518218734]
 * x: [-75.60690496642678, -75.31763796994308, ... , -75.31763796994387, -75.60690496642752]
     3     9.137205e-01     5.620519e-03
 * Current step size: 0.0
 * time: 2.3340020179748535
 * g(x): [0.005601410518216981, 0.005601363982856158, ... , 0.0056013639828579025, 0.005601410518218734]
 * x: [-75.60690496642678, -75.31763796994308, ... , -75.31763796994387, -75.60690496642752]

This one is particularly bad, since the step size goes to zero, causing the optimization to fail.

It seems somewhat likely that the exceedingly large step size (3075!!!) in the second step is causing problems here. The linesearch clearly goes off the rails. Incidentally, is there a way to prevent this? Different linesearch algorithm, as in https://github.com/JuliaNLSolvers/Optim.jl/issues/922? Can I limit α to "reasonable" values (the update in each iteration shouldn't be larger than the value). The solution that LBFGS finds here (see the linked notebook) is technically correct, but the problem can be solved with control parameters of much smaller amplitudes, see the solution of the same optimization problem with a different method, https://qucontrol.github.io/krotov/v1.2.1/notebooks/01_example_simple_state_to_state.html.

In any case, though, even in such a pathological case of the linesearch going crazy, I find it a bit worrying that Optim isn't giving me reproducible results. I would expect to see the exact same numbers independent of any flags I might give to Julia

goerz avatar Oct 17 '21 06:10 goerz

I should probably add that the notebook is for illustration only. The actual code that produces the traces I've posted above is from the tests / example of the GRAPE package (which I converted to the notebook with Literate.jl and annotated with some extra comments).

To reproduce, you'd have to dev-install all the packages in the JuliaQuantumControl org, then go to the GRAPE.jl folder and run make test there (bound checking), or make devrepl followed by include("test/runtests.jl") (no bound checking).

goerz avatar Oct 17 '21 14:10 goerz

There are a couple of different things here. First, that very large step size is indeed very large, but I cannot exclude that it is indeed the smallest step size that produces a step that satisfies the wolfe conditions. You could try backtracking instead.

Wrt the boundschecking, that is indeed... interesting. I don't remember of bounds checking turned on/off can result in different numerical results if for example some optimizations are possible/impossible. If it was an indexing mistake, you'd think the forced bounds would throw an error, wouldn't you?

Since it's taking such a long step I'd guess the approximation is ill-conditioned. Even small changes in vectorization (simd, etc) can make this go in one direction or the other...

pkofod avatar Oct 20 '21 17:10 pkofod

There are a couple of different things here.

Right. I did a bit of further analysis, looking at the gradient, the LBFGS search direction (P⁻¹∇J_T), and how the functional J_T (that's my f) behaves when moving by a distance α in those two direction.

For the default parameters, this looks as follows for the first two iterations:

default

The gradient looks good, but, as observed previously, the linesearch goes way overboard. In fact, going in the direction of the gradient by α=100 or so would almost solve the problem (J_T is between 1 and 0, and I want to go to 0). The value α=3075 is just a (pretty unfortunate) local minimum.

This is issue number 1 (not a bug in Optim.jl, just something I'm trying to get a handle on)

In the second iteration, the LBFGS search direction, i.e. the inverse of the estimated Hessian P⁻¹ is suddenly 4 orders of magnitude larger than the gradient. If find this a bit unexpected, but I don't know enough about the internals of LBFGS to know if this makes sense. This is issue number 2.

As a bit of background, last time I did optimal quantum control with LBFGS was a good 10 years ago, using the L-BFGS-B Fortran implementation. I've probably forgotten some of the details since then, and that library has less knobs to turn (if any; the linesearch is fixed as I recall, whatever algorithm that is). I don't remember it being overly fiddly and pretty comparable to the optimization method I normally use (Krotov's method). I've treated LBFGS really as a black-box so far, thinking of it as gradient-descent with some sort of second-order correction "for free".

First, that very large step size is indeed very large, but I cannot exclude that it is indeed the smallest step size that produces a step that satisfies the wolfe conditions. You could try backtracking instead.

Possible, I never actually tried to understand what the Wolfe conditions do. I should probably look into that ;-)

I do get better solutions with either backtracking, or limiting the linesearch with something like

alphaguess=LineSearches.InitialStatic(alpha=0.2)
linesearch=LineSearches.HagerZhang(alphamax=2.0)

I'm still seeing my "issues 1 and 2", though. For the backtracking, these are four "interesting" iterations:

backtracking1

backtracking2

In everything before iteration 47, gradient and LBFGS direction are basically identical, and the linesearch looks good, progressing slowly towards the mininum. Than iteration 48 suddently again has a four order of magnitude jump in the LBFGS search direction (P⁻¹), and the linesearch overshoots, although not as dramatically. In the subsequent iterations 49 and 50 (and the remainder of the optimization, which I stop when J_T goes below 10⁻⁴, the LBFGS search direction remains large compared to the gradient, but at least the linesearch looks much more reasonably (and the resulting update to the control field, top panels, is pretty small)

Wrt the boundschecking, that is indeed... interesting. I don't remember of bounds checking turned on/off can result in different numerical results if for example some optimizations are possible/impossible. If it was an indexing mistake, you'd think the forced bounds would throw an error, wouldn't you?

I would! :-) I really have no idea what might be going on here, apart from floating point rounding errors being amplified. So this is still "issue 3", but practically speaking it's not the end of the world. Generally, I like to have my numerical results be reproducible, and not depend on "compiler flags" or such things. On the other hand, ultimately the only thing that matters is finding a good solution to the control problem.

Maybe/Hopefully this behavior will go away if I get a handle on my "issues 1/2". Also, the example I'm doing here is actually a rather trivial control problem (it can be solved analytically), and this might actually make the numerics more unstable. I should try with some less trivial problems. I'll report back on whether I'm continuing to see the issue with more well-behaved optimizations.

Since it's taking such a long step I'd guess the approximation is ill-conditioned. Even small changes in vectorization (simd, etc) can make this go in one direction or the other...

Right. Can you elaborate on that? As I said, I've mainly treated LBFGS as a black box, so I'm not sure when something is ill-conditioned. I've seen "preconditioning" mentioned in the Optim.jl docs. Is that something I should look at?

One thing that I remember doing in Fortran was multiplying the gradient that I feed to L-BFGS-B with some scaling factor (that's because the magnitude of the control parameters depends on the choice physical units, so it seems reasonable to compensate for that so that typical gradients are of a reasonable magnitude). I haven't played around with that yet in the Julia code. Do you think it would be sufficient to solve my issues?

Anyway... I'm going a bit off the rails here, so don't feel obligated to engage with my lack of understanding of LBFGS' inner workings. The only thing that's an actual technical issue is the non-reproducibility depending on "compiler flags", and it's a minor one.

That being said, if you have quick answers to any of the following questions based on your experience with LBFGS / actually having implemented the method, the help would be greatly appreciated!

  • Is there any easy trick to making the linesearch robust against overshooting?
  • Is the sudden increase by several orders of magnitude in the LBFGS search direction expected?
  • Do you know what the linesearch method in the L-BFGS-B Fortran code is? I might want to emulate that in Julia
  • Do you know if the "B" part of L-BFGS-B, i.e. putting bounds on the control parameters, is more or less equivalent to the bounded optimizations in Optim.jl? (pending #927)

goerz avatar Oct 26 '21 04:10 goerz

For comparison, this is how L-BFGS-B performs:

lbfgs

Much more well-behaved! (and no linesearch parameters to tune, which is a good thing)

goerz avatar Oct 30 '21 18:10 goerz

Interesting. LBFGS.jl wraps the original code, so that is certainly a very good comparison. You provide a lot of information which is good, but I probably have to read it again to answer your questions :)

I don't find it impossible that there might be a bug somewhere, or a less relaxed approach to some numerical issue on our side. It may also simply be that L-BFGS-B provides an upper bound on how large the step can become and treats it as a valid bracket if it actually goes that far in the quest to satisfy the wolfe conditions. The code is public and with an open license, so I guess I might have to have a look at some point.

The information you provide is very useful. So thank you for taking your time.

Is there any easy trick to making the linesearch robust against overshooting?

A bound on the step size. Not everything we have supports it I think

* Is the sudden increase by several orders of magnitude in the LBFGS search direction expected?

I think it's becaues of the very large step?

* Do you know what the linesearch method in the L-BFGS-B Fortran code is? I might want to emulate that in Julia

I think it' MoreThuente that we also have, but I'm not 100% sure.

* Do you know if the "B" part of L-BFGS-B, i.e. putting bounds on the control parameters, is more or less equivalent to the bounded optimizations in `Optim.jl`?

It's not. The one in Optim is worse.

pkofod avatar Oct 31 '21 21:10 pkofod

For the default parameters, this looks as follows for the first two iterations:

Can you show me a script so I can generate that alpha=3000 iteration? Thnkas

pkofod avatar Oct 31 '21 23:10 pkofod

For the default parameters, this looks as follows for the first two iterations:

Can you show me a script so I can generate that alpha=3000 iteration? Thnkas

Sure! Here it is: https://gist.github.com/goerz/f9ef5228bbe23a9358a60e91f961d95d#file-simple_state_to_state-jl

This is adapted from https://github.com/JuliaQuantumControl/GRAPE.jl/blob/master/examples/simple_state_to_state.jl

By switching between lines 83 and 84, you can switch between LBFGSB (LBFGSB.jl) and Optim's LBFGS. You could also pass in another Optim.LBFBS optimizer with different (linesearch) parameters.

I've put the relevant Pkg lines in the script so that it uses the exact current revisions for the packages in JuliaQuantumControl, so as long as you run the script in its own environment (see comment at the top of file), it should run through. It will generate images linesearch-001.png etc (via GRAPELinesearchAnalysis.jl called in line 80) in the same folder. This uses matplotlib/PyCall, which I've found sometimes needs to be installed "manually" to work properly

goerz avatar Nov 01 '21 01:11 goerz

Much more well-behaved! (and no linesearch parameters to tune, which is a good thing)

I do wonder... Am I reading this correct, that L-BFGS-B also takes a big step? 175-ish. I know it's not 3000, but..

pkofod avatar Nov 02 '21 20:11 pkofod

Also, can you maybe help me understand the first panel? What is "guess" and "opt"?

pkofod avatar Nov 02 '21 20:11 pkofod

It’s the control values: “opt” is the output x in the current iteration, and “guess” is the x_previous, i.e., the input control values.

For this application, the control values are the values of a function ϵ(t) at the midpoints of a time grid (tlist). So the label “time step” for the x-axis of the top three panels is really the index of the component of the x/gradient/searchdirection vector. Might have been better to plot this with markers or even a bar plot to emphasize that these are really vectors, not continuous functions. Does that make sense?

goerz avatar Nov 02 '21 20:11 goerz

Didn't you use opt in both cases? :)

Anyway, I just tried MoreThuente() from LineSearches as I suggested above. I'm pretty sure that's what they use in LBFGSB. In that case, it converges nicely in six iterations.

I see you've changed to LBFGSB.jl which is totally fine, but if I could have you try out LBFGS with MoreThuente, I'd be happy to learn more from your runs. I appreciate that you've spent so much time already, and I am sorry for any time you may have wasted on this as well.

pkofod avatar Nov 02 '21 20:11 pkofod

Didn't you use opt in both cases? :)

Yes, typo, sorry, fixed :-)

The LBFGSB paper references J. J. Moré and D. J. Thuente, ACM Trans. Math. Softw.  20  286  (1994) for the linesearch, which I'm assuming matches the MoreThuente in Optim.jl. I will do a comparison between LBFGSB and Optim.LBFGS with MoreThuente linesearch when I have a chance.

goerz avatar Nov 02 '21 21:11 goerz

Yes, that is our implementation of the same. As I noted, it did actually converge quickly for the example you posted but there are never any good guarantees with these things. It might fail for another example :)

Edit: and MoreThuente should be relatively similar to HagerZhang so maybe there's an issue with the latter. I have reimplemented it in NLSolvers.jl, it could also be worth trying that but that's not as polished yet.

pkofod avatar Nov 03 '21 06:11 pkofod

@goerz sorry for spamming you, but I wanted to look at it, and I cannot get it to produce the same run that you posted. The one with 40-50 iterations and one of them being very weird. Is it supposed to happen just by running that script with no modifications?

pkofod avatar Nov 03 '21 20:11 pkofod

If you mean the second example from https://github.com/JuliaNLSolvers/Optim.jl/issues/953#issuecomment-951543625, where I've included the plot for iterations 47-50, that looks like it would have been

opt_result = optimize_grape(
        problem,
        info_hook=chain_infohooks(
            GRAPELinesearchAnalysis.plot_linesearch(@__DIR__),
            GRAPE.print_table,
        ),
        optimizer=Optim.LBFGS(
            alphaguess=LineSearches.InitialStatic(alpha=0.2),
            linesearch=LineSearches.HagerZhang(alphamax=2.0)
        ),
);

i.e., changing the initial and maximum α. Although when I rerun that with the example script, I'm now getting convergence in 27 iterations, and the "jump" where the LBFGS search direction suddenly jumps by 4 orders of magnitude is at iteration 25. I'm not sure if there was some other parameter that I didn't keep track of accounting for the change in the number of iterations, or if it's because I've switched computers from an Intel to an M1 processor since the original post, or if it's due to some floating-point instability. Still, qualitatively it's the same behavior.

Incidentally, I also just tried Optim's More-Thuente linesearch, with

opt_result = optimize_grape(
        problem,
        info_hook=chain_infohooks(
            GRAPELinesearchAnalysis.plot_linesearch(@__DIR__),
            GRAPE.print_table,
        ),
        optimizer=Optim.LBFGS(linesearch=MoreThuente()),
);

and that gives the following result:

morethuente

which is extremely good! Compared with L-BFGS-B in https://github.com/JuliaNLSolvers/Optim.jl/issues/953#issuecomment-955577686 it's actually even slightly better: L-BFGS-B overshoots slightly in the 3rd iteration and then corrects in the 4th, whereas Optim's More-Thuente directly goes to the local minimum in iteration 3

goerz avatar Nov 03 '21 22:11 goerz

I think HagerZhang is a bit too aggressive as it increases the bracket by 5 if the approximate wolfe conditions are not satisfied https://github.com/JuliaNLSolvers/LineSearches.jl/blob/deb5db05d5aeaa97a218daba963ac3f50422c8c0/src/hagerzhang.jl#L87. MoreThuente sets it to 4 but it could be between 1.1 and 4 according to theory https://github.com/JuliaNLSolvers/LineSearches.jl/blob/deb5db05d5aeaa97a218daba963ac3f50422c8c0/src/morethuente.jl#L222 .

HagerZhang exposes this, so you could try to set rho lower than 5 (say 2) or I could modify MoreThuente to have the delta parameter exposed and set maybe a bit lower. The methods are set up this way because if the initial set doesn't contain a parameter that satisfies the strong wolfe conditions then we want to rapidly find an interval that does instead of spending time taking tiny steps. However, I fear that in many problems it can be problematic to step 5 times the search direction because you can easily get into numerically problematic territory (see your wobly line search graphs for example).

Thanks for bringing all of this to my attention.

pkofod avatar Nov 04 '21 08:11 pkofod

For your example I get

* Candidate solution
    Final objective value:     -5.773160e-15

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 2.77e-06 ≰ 0.0e+00
    |x - x'|/|x'|          = 4.48e-09 ≰ 0.0e+00
    |f(x) - f(x')|         = 5.62e-12 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 9.73e+02 ≰ 0.0e+00
    |g(x)|                 = 5.80e-12 ≤ 1.0e-08

 * Work counters
    Seconds run:   11  (vs limit Inf)
    Iterations:    9
    f(x) calls:    41
    ∇f(x) calls:   41

with no changes to LBFGS and with HagerZhang(rho=2.0) I get

 * Status: success

 * Candidate solution
    Final objective value:     -7.549517e-15

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 3.12e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.96e-07 ≰ 0.0e+00
    |f(x) - f(x')|         = 6.20e-13 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 8.21e+01 ≰ 0.0e+00
    |g(x)|                 = 8.51e-14 ≤ 1.0e-08

 * Work counters
    Seconds run:   8  (vs limit Inf)
    Iterations:    6
    f(x) calls:    22
    ∇f(x) calls:   22

Just one data point of course, but...

pkofod avatar Nov 04 '21 08:11 pkofod