PyTorch-LBFGS
PyTorch-LBFGS copied to clipboard
(yet) Another stochastic L-BFGS implementation for interest
Hi Michael,
I wanted to share a stochastic first order optimizer idea and implementation for L-BFGS, in case you might find it interesting or even practical in your research (congrats also on your PhD & new role at fb!)
The idea behind the algorithm is pretty simple in brief:
- when optimizing a stochastic empirical loss function, use bootstrap sampling over mini-batches of data to estimate the function value, the variance, and sampling bias
- run a first-order solver using a line search that treats the variance estimates from the bootstapped function/gradient evaluations as numerical uncertainties, such that the line search routine looks for step sizes points that satisfy the Wolfe conditions within the uncertainty level
I implemented the idea above in the following repository using tensorflow
:
https://github.com/mbhynes/boots
I admittedly haven't run particularly thorough experiments on this, but did use a small convnet as a toy problem and it looks like the convergence of the bootstrapped L-BFGS algorithm can have the same or better convergence as ADAM up to a point, measured by the decrease in training loss vs the # of function/gradient evaluations [ie data points accessed], e.g. here
There's no formal convergence criteria implemented, and empirically it looks like the optimizer can get within a radius of a minimum, but then just start oscillating a bit without making substantive progress.
The real practical downside to all this is that neither tensorflow
nor pytorch
are particularly efficient at evaluating per-example gradients... so unfortunately even if the trace above looks sorta neat, in reality you have to wait quite a while to get it from bootstrapping, compared to ADAM (which I've just taken as the ~best off-the-shelf algo) since the boots
implementation above doesn't have an efficient way of evaluating the jacobian matrix. (It wouldn't be quite so bad in a map-reduce framework where per-example gradients have to be materialized mind you...)
Anyway, hope it's of interest, and all the best!