pymc
pymc copied to clipboard
ENH: Learning rate schedulers for ADVI optimizers
Before
No response
After
import pytensor
import pytensor.tensor as pt
from functools import wraps
def step_lr_scheduler(optimizer, update_every, gamma=0.1):
# optimizer is a functools.parital, so we can update the keyword arguments in place
# by mutating the .keywords dictionary
kwargs = optimizer.keywords
# Replace the provided learning_rate wiht a shared variable
shared_lr = pytensor.shared(kwargs['learning_rate'], 'learning_rate')
# Set partial function keyword argument to the new shared variable
kwargs['learning_rate'] = shared_lr
@wraps(optimizer)
def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
# Get the updates dictionary from optimizer
updates = optimizer(loss_or_grads, params, *args, **kwargs)
# The last update for all the optimizer is the timestep (is this always true?)
# we need to use the time shared variable to do our lr update (so everything is in synch)
t = updates[list(updates.keys())[-1]]
# Here's the acutal learning rate update function
new_lr = pt.switch(pt.eq(pt.mod(t, update_every), 0),
shared_lr * gamma,
shared_lr)
# Add the learning rate update to the updates dictioanry
updates[shared_lr] = new_lr
return updates
# Return the wrapped optimizer partial function
return scheduled_optimizer
Context for the issue:
This has come up twice in the discourse, and I have a partial working solution already, so I'm mostly opening this to take the temperature of the room and see if people think this is a worthy feature.
Learning rate scheduling is important to training large deep learning models. Basically instead of setting a fixed learning rate, you pick something high, then automatically anneal it down as training progresses, according to some rule. There's all sorts of wacky schedules to choose from, see the Pytorch docs for examples. I don't know of any research into using this on Bayesian models (disclaimer: I didn't look), but I don't see any reason why it wouldn't also be helpful in this context, especially for large/complex models. At the bare minimum it would be interesting for training Bayesian NNs built with PyMC, a sexy but under-supported ability of the library.
It's not actually too hard to given our current setup, it just requires a wrapper around an optimizer that injects some extra updates to the updates
dictionary all the optimizers return. See the example code above.
Grabbing existing shared variables we want out of the update dictionary is pretty ugly. One tweak to the existing code base that would need to be made is to name all the shared variables that get passed around. Then we could write a function that retrieves them by name. It would make dprint
s a bit nicer, too.
One other nice feature of naming things would be to allow for composable schedulers. Each wrapper could check for the existence of a learning_rate
shared variable in the updates dictionary first, and overwrite it include additional update computations. So one could combine, say, a geometrically decreasing learning rate with cosine annealing for some kind of "dampened cycle".
Thoughts?