PyTorch-LBFGS icon indicating copy to clipboard operation
PyTorch-LBFGS copied to clipboard

(yet) Another stochastic L-BFGS implementation for interest

Open mbhynes opened this issue 2 years ago • 0 comments

Hi Michael,

I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)

The idea behind the algorithm is pretty simple in brief:

  • when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
  • run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level

I implemented the idea above in the following repository using tensorflow: https://github.com/mbhynes/boots

I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here

There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.

The real practical downside to all this is that neither tensorflow nor pytorch are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the boots implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)

Anyway, hope it's of interest, and all the best!

mbhynes avatar Sep 19 '22 15:09 mbhynes