pymc icon indicating copy to clipboard operation
pymc copied to clipboard

Added ReduceLROnPlateau callback for VI.

Open alvaropp opened this issue 2 years ago • 13 comments

This has been discussed in the forums and in this issue.

There’s big parallels between PyMC’s variational inference and training neural networks, where the choice of optimiser and learning rate have a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.

Currently, in PyMC, you need to specify a suitable learning rate that is used for fitting the model. Too large and it won't converge, too small and it will be too slow. Training once with a large-ish learning rate and then taking the results of that training round as a starting point for another training round with a smaller learning rate is not trivial and not very elegant.

To address this issue, I've implemented a new callback for pymc.variational, following Keras' ReduceLROnPlateau. Basically it monitors the model loss at every iteration of VI.fit() and it reduces the learning rate by a user-specified amount if the loss doesn't improve after a user-specified number of iterations.

Major / Breaking Changes

None (I think!)

New features

Added a ReduceLROnPlateau callback.

Bugfixes

None.

Documentation

None so far, but it would be nice to add an example.

Maintenance

None.


:books: Documentation preview :books:: https://pymc--7011.org.readthedocs.build/en/7011/

alvaropp avatar Nov 14 '23 14:11 alvaropp

Thank You Banner :sparkling_heart: Thanks for opening this pull request! :sparkling_heart: The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

welcome[bot] avatar Nov 14 '23 14:11 welcome[bot]

@jessegrabowski raised a good point that it's not ideal to have the user need to define a shared variable, so I've reworked my callback so that it takes a vanilla PyMC optimiser instance and it modifies its learning rate directly:

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

alvaropp avatar Nov 14 '23 15:11 alvaropp

Codecov Report

Merging #7011 (d375bf0) into main (ad450a6) will increase coverage by 1.46%. Report is 20 commits behind head on main. The diff coverage is 80.68%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7011      +/-   ##
==========================================
+ Coverage   87.66%   89.12%   +1.46%     
==========================================
  Files         101      101              
  Lines       16964    16982      +18     
==========================================
+ Hits        14871    15136     +265     
+ Misses       2093     1846     -247     
Files Coverage Δ
pymc/variational/opvi.py 74.17% <100.00%> (-13.22%) :arrow_down:
pymc/variational/updates.py 91.74% <98.86%> (-0.37%) :arrow_down:
pymc/variational/callbacks.py 55.55% <23.21%> (-40.68%) :arrow_down:

... and 42 files with indirect coverage changes

codecov[bot] avatar Nov 15 '23 09:11 codecov[bot]

Is it possible to add an example somewhere? maybe here: https://www.pymc.io/projects/examples/en/latest/gallery.html#variational-inference

alvaropp avatar Nov 15 '23 09:11 alvaropp

I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.

I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that

I've implemented an ExponentialDecay callback and refactored things a bit. plateau expdecay expdecay_staircase

alvaropp avatar Nov 16 '23 10:11 alvaropp

Hi @jessegrabowski, any concrete ideas for taking this PR forward?

As you said, it would be nice to infer the optimiser from the code rather than pass it to the callback, but I don't know how to go about that.

with pm.Model() as model:
    [...]

    optimiser = pm.adam(learning_rate=0.1)
    fit = pm.fit(
        obj_optimizer=optimiser,
        callbacks=[
            ReduceLROnPlateau(optimiser=optimiser)
        ],
    )

alvaropp avatar Nov 22 '23 07:11 alvaropp

@alvaropp sorry for letting this get stale, I am going to do a careful review now/this weekend so you can get back on track and merged in ASAP

jessegrabowski avatar Dec 01 '23 21:12 jessegrabowski

I'm going over the code carefully, and I don't think the callbacks as written do anything. There is a misunderstanding how about compiled pytensor functions work.

The implementation is very nice, and uses the following logic:

  1. There is a learning_rate parameter in every optimizer object.
  2. The optimizer object is called iteratively inside the _iterate_with_loss method of the Inference class.
  3. Therefore, we instantiate the optimizer class once and share it between a callback and the Inference class
  4. At each step, the callback will update the learning_rate of the optimizer
  5. Which will be propagated to the Inference class (since it's the same object).

Unfortunately, this is not at all how the compiled pytensor function works. First, the implementation of the optimizers is deceptive. They are not classes, they are partial functions, whose only role is to return an updates dictionary.

This updates dictionary is important, because after a pytensor function is compiled, changes to variables outside the function will have no effect. Consider the following graph:

x = pt.dscalar('x')
y = 1.0
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y = 2

print(f(4)) # 5.0

What happened? After running pytensor.function to compile the function, the variable y in the python namespace no longer has anything to do with the computation. If we try to naively update it, nothing will happen. This is exactly the situation we are in when running pm.fit. If you check the fit method in the Inference class, the line that actually does all the computation is this innocuous one. What, then, is step_func? It's created here, and as you can see, it's a compiled pytensor function (compiled using compile_pymc rather than pytensor.function, but if you keep digging you'll see pytensor.function is called by compile_pymc). So we're in exactly the case of my code snippet above.

To illustrate what's going on, I made a dummy "optimizer" that always adds the learning rate to the parameters:

from pymc.variational.updates import _get_call_kwargs, get_or_compute_grads
from functools import partial
from collections import OrderedDict

def dummy_optimizer(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    return updates

We can use this optimizer to check the effect of the current schedulers. Helper functions:

def make_regression():
    X = np.random.normal(size=(100,))
    beta = np.random.normal()
    alpha = np.random.normal()
    noise = np.random.normal(size=(100,))
    y = alpha + X * beta + noise

    return X, y

def make_model():
    X, y = make_regression()
    with pm.Model() as mod:
        b = pm.Normal('b')
        a = pm.Normal('a')
        # sigma = pm.HalfNormal('sigma')
    
        mu = a + X * b
        y_hat = pm.Normal('y_hat', mu, 1, observed=y)

    return mod

Here's a base case with no scheduling callback. I do use a tracker so we can see how the dummy works:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Check that the parameters are deterministically updated by the learning rate at every step:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

Now run it again with the ExponentialDecay schedule callback:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
scheduler = pm.callbacks.ExponentialDecay(optimizer=optimizer, decay_steps=10, decay_rate=0.5)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker, scheduler], obj_optimizer=optimizer)

Check the mean history:

mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True

As you can see, the learning rate has not changed. We can check that the scheduler updated as expected:

scheduler.optimizer.keywords['learning_rate'] == 1 * 0.5 ** 10 # True

We are in the case of the simple pytensor code snippet posted above. The learning_rate kwarg in the optimizer is updating (this variables plays the part of the y variable in my simple example). But the train has already left the station -- the function is compiled, so this variable is totally irrelevant to the computation.

So what is the solution? To interact with a compiled function, you need to use shared variables. These are a special type of pytensor object that can be adjusted between function calls. Returning to the code snippet, let's use a shared variable:

x = pt.dscalar('x')
y = pytensor.shared(1.0, name='y')
z = x + y
f = pytensor.function([x], z)

print(f(4)) # 5.0
print(f(5)) # 6.0

y.set_value(2)

print(f(4)) # 6.0

After using a shared variable, we can adjust the value of y to adjust the output of f, even though it's already compiled.

The way that shared variables can be automatically updated from run-to-run of a function is through the use of an update dictionary. An update dictionary is a mapping from old shared values to new shared values. This is exactly the dictionary that is returned by the optimizer, as noted above. Let's take a look at the SGD function from here:

    grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param, grad in zip(params, grads):
        updates[param] = param - learning_rate * grad

    return updates

You can see that it computes the gradients of the loss function, then computes SGD update rule for each parameter in turn. It stores these in an OrderedDict, where the keys are the old parameters, and the values are the updated parameters. It then returns the dictionary.

As a final demonstration, let's adjust the dummy optimizer to have an exponential decay on the learning rate:

def dummy_optimizer_w_decay(loss_or_grads=None, params=None, learning_rate=1e-3):
    if loss_or_grads is None and params is None:
        return partial(dummy_optimizer, **_get_call_kwargs(locals()))
    elif loss_or_grads is None or params is None:
        raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
    # grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()
    if not isinstance(learning_rate, pt.sharedvar.SharedVariable):
        learning_rate = pytensor.shared(pm.floatX(learning_rate), name='learning_rate')

    for param in params:
        new_param = param + learning_rate
        updates[param] = new_param

    updates[learning_rate] = learning_rate * 0.9
    return updates

And see how this looks in a model:

learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)

with make_model():
    advi = pm.ADVI()
    tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
    approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)

Here's a plot of the tracked parameters. As you can see, we now get the desired updating in the learning rate.

mean_history = np.r_[tracker['mean']]
plt.plot(mean_history)

Untitled

jessegrabowski avatar Dec 02 '23 11:12 jessegrabowski

So can the callback approach be saved? Unfortunately, I don't think so. By the time the callbacks are invoked, in the fit loop, the step_func function is already compiled. At this stage it's too late to adjust the function without some pretty impressive gymnastics. This is going to have to go back to the drawing board using something like the wrapper approach.

jessegrabowski avatar Dec 02 '23 11:12 jessegrabowski

I tried my hand at implementing the schedulers using the function approach I suggested on the discourse. Here's how it looks:

import pandas as pd

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import numpy as np
import pytensor
import pytensor.tensor as pt

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Xt = pytensor.shared(X_train)
yt = pytensor.shared(y_train)

with pm.Model() as iris_model:
    # Coefficients for features
    β = pm.Normal("β", 0, sigma=1e2, shape=(4, 3))
    # Transoform to unit interval
    a = pm.Normal("a", sigma=1e4, shape=(3,))
    p = pt.special.softmax(Xt.dot(β) + a, axis=-1)

    observed = pm.Categorical("obs", p=p, observed=yt)

opt = pm.adamax(learning_rate=1)
scheduled_optimizer = pm.updates.reduce_lr_on_plateau_scheduler(opt, factor=0.25, cooldown=1000, patience=3000, min_lr=1e-12)
# scheduled_optimizer = pm.updates.exponential_decay_scheduler(opt, decay_steps=1, decay_rate=0.9999, staircase=False, min_lr=1e-8)

with iris_model:
    inference = pm.ADVI()

    updates = inference.objective.updates(obj_optimizer=scheduled_optimizer)
    inputs = list(updates.keys())
    outputs = list(updates.values())

    old_names = [x.name for x in inputs]
    new_names = [x.name for x in outputs]

    expected_names = ['best_loss', 'cooldown_counter', 'wait', 'learning_rate']
    outputs_of_interest = [outputs[new_names.index(f'{expected_name}__updated')] for expected_name in
                           expected_names]
    tracker = pm.callbacks.Tracker(best_loss = outputs_of_interest[0].eval, learning_rate = outputs_of_interest[3].eval,
                                   wait = outputs_of_interest[2].eval, cooldown_counter = outputs_of_interest[1].eval)
    convergence_trackers = [pm.callbacks.CheckParametersConvergence(every=100, diff='relative'),
                            pm.callbacks.CheckParametersConvergence(every=100, diff='absolute')]
    approx = inference.fit(1_000_000, callbacks=[tracker] + convergence_trackers, obj_optimizer=scheduled_optimizer)

Untitled

There's still bugs, and I don't fully understand the VI codebase. I refactored a lot of stuff in variational/updates.py to make it possible to pass around the loss function.

Bugs are that the schedulers assume the loss function returned by the Approximations is always a scalar, which isn't true. I wasn't able to figure out what the extra graphs returned by e.g. FullRankAVDI and SVGD are. So these work with ADVI only for now. Also tracking cooldown_counter and wait doesn't work, but clearly they are doing something internally because the learning rate is updating according to them. I'm not sure if there's a better way to expose these new variables to tracking.

Consider it a suggestion, if you come up with a better approach you can feel free to revert this commits. I was thinking your approach might work combined with some of the refactoring I did? You would need to intercept the optimizers and make the learning rate a shared variable (like I do here), then use learning_rate.set_data from your callback objects to get the updates (that you compute off the graph) into the graph. My approach is totally on the graph -- I have no intuition about which is better. The one thing I can say is that I think my approach be 100% composable, so we can implement a helper function like pm.schedule_learning_rate or whatever that could take a list of schedulers and apply them in sequence.

jessegrabowski avatar Dec 02 '23 23:12 jessegrabowski

D'oh, good spot!

alvaropp avatar Dec 04 '23 14:12 alvaropp

It seems like a good idea to use schedulers universally, with a Constant scheduler as the default. This may be a cleaner API and allow us to easily infer the optimizer, etc.

fonnesbeck avatar Jan 05 '24 16:01 fonnesbeck

My proposed changes follow the pytorch API that looks like this:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')

So the user has to pass the optimizer to the scheduler explicitly in the code. It's also written so that users can compose schedulers by chaining them together.

From a clean code perspective I like the idea of a "ConstantOptimizer" as a base class from which others inherit from, but I don't think it makes sense the way I have it here. If the scheduler were added as a keyword to pm.ADVI I would agree, but then we lose the ability to compose (plus we already have too many keywords).

jessegrabowski avatar Jan 05 '24 22:01 jessegrabowski