pymc icon indicating copy to clipboard operation
pymc copied to clipboard

ENH: Learning rate schedulers for ADVI optimizers

Open jessegrabowski opened this issue 1 year ago • 0 comments

Before

No response

After

import pytensor
import pytensor.tensor as pt
from functools import wraps

def step_lr_scheduler(optimizer, update_every, gamma=0.1):
    # optimizer is a functools.parital, so we can update the keyword arguments in place
    # by mutating the .keywords dictionary
    kwargs = optimizer.keywords
    
    # Replace the provided learning_rate wiht a shared variable
    shared_lr = pytensor.shared(kwargs['learning_rate'], 'learning_rate')
    
    # Set partial function keyword argument to the new shared variable
    kwargs['learning_rate'] = shared_lr
    
    @wraps(optimizer)
    def scheduled_optimizer(loss_or_grads, params, *args, **kwargs):
        # Get the updates dictionary from optimizer
        updates = optimizer(loss_or_grads, params, *args, **kwargs)
        
        # The last update for all the optimizer is the timestep (is this always true?)
        # we need to use the time shared variable to do our lr update (so everything is in synch)
        t = updates[list(updates.keys())[-1]]
        
        # Here's the acutal learning rate update function
        new_lr = pt.switch(pt.eq(pt.mod(t, update_every), 0),
                                 shared_lr * gamma,
                                 shared_lr)
        
        # Add the learning rate update to the updates dictioanry
        updates[shared_lr] = new_lr
        return updates
    
    # Return the wrapped optimizer partial function
    return scheduled_optimizer

Context for the issue:

This has come up twice in the discourse, and I have a partial working solution already, so I'm mostly opening this to take the temperature of the room and see if people think this is a worthy feature.

Learning rate scheduling is important to training large deep learning models. Basically instead of setting a fixed learning rate, you pick something high, then automatically anneal it down as training progresses, according to some rule. There's all sorts of wacky schedules to choose from, see the Pytorch docs for examples. I don't know of any research into using this on Bayesian models (disclaimer: I didn't look), but I don't see any reason why it wouldn't also be helpful in this context, especially for large/complex models. At the bare minimum it would be interesting for training Bayesian NNs built with PyMC, a sexy but under-supported ability of the library.

It's not actually too hard to given our current setup, it just requires a wrapper around an optimizer that injects some extra updates to the updates dictionary all the optimizers return. See the example code above.

Grabbing existing shared variables we want out of the update dictionary is pretty ugly. One tweak to the existing code base that would need to be made is to name all the shared variables that get passed around. Then we could write a function that retrieves them by name. It would make dprints a bit nicer, too.

One other nice feature of naming things would be to allow for composable schedulers. Each wrapper could check for the existence of a learning_rate shared variable in the updates dictionary first, and overwrite it include additional update computations. So one could combine, say, a geometrically decreasing learning rate with cosine annealing for some kind of "dampened cycle".

Thoughts?

jessegrabowski avatar Oct 13 '23 19:10 jessegrabowski