pymc
pymc copied to clipboard
Added ReduceLROnPlateau callback for VI.
This has been discussed in the forums and in this issue.
There’s big parallels between PyMC’s variational inference and training neural networks, where the choice of optimiser and learning rate have a huge impact on training quality and speed. A common technique for training neural networks is using learning rate schedulers which reduce the learning rate on a schedule, to get faster convergence by starting high and reducing it in successive epochs where you want to be more precise.
Currently, in PyMC, you need to specify a suitable learning rate that is used for fitting the model. Too large and it won't converge, too small and it will be too slow. Training once with a large-ish learning rate and then taking the results of that training round as a starting point for another training round with a smaller learning rate is not trivial and not very elegant.
To address this issue, I've implemented a new callback for pymc.variational, following Keras' ReduceLROnPlateau. Basically it monitors the model loss at every iteration of VI.fit() and it reduces the learning rate by a user-specified amount if the loss doesn't improve after a user-specified number of iterations.
Major / Breaking Changes
None (I think!)
New features
Added a ReduceLROnPlateau callback.
Bugfixes
None.
Documentation
None so far, but it would be nice to add an example.
Maintenance
None.
:books: Documentation preview :books:: https://pymc--7011.org.readthedocs.build/en/7011/
:sparkling_heart: Thanks for opening this pull request! :sparkling_heart: The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.
@jessegrabowski raised a good point that it's not ideal to have the user need to define a shared variable, so I've reworked my callback so that it takes a vanilla PyMC optimiser instance and it modifies its learning rate directly:
with pm.Model() as model:
[...]
optimiser = pm.adam(learning_rate=0.1)
fit = pm.fit(
obj_optimizer=optimiser,
callbacks=[
ReduceLROnPlateau(optimiser=optimiser)
],
)
Codecov Report
Merging #7011 (d375bf0) into main (ad450a6) will increase coverage by
1.46%. Report is 20 commits behind head on main. The diff coverage is80.68%.
Additional details and impacted files
@@ Coverage Diff @@
## main #7011 +/- ##
==========================================
+ Coverage 87.66% 89.12% +1.46%
==========================================
Files 101 101
Lines 16964 16982 +18
==========================================
+ Hits 14871 15136 +265
+ Misses 2093 1846 -247
| Files | Coverage Δ | |
|---|---|---|
| pymc/variational/opvi.py | 74.17% <100.00%> (-13.22%) |
:arrow_down: |
| pymc/variational/updates.py | 91.74% <98.86%> (-0.37%) |
:arrow_down: |
| pymc/variational/callbacks.py | 55.55% <23.21%> (-40.68%) |
:arrow_down: |
Is it possible to add an example somewhere? maybe here: https://www.pymc.io/projects/examples/en/latest/gallery.html#variational-inference
I still prefer the Torch-style API to the callback API, but this PR is great so I'll let it go. The tests need to be a bit more robust, though. I also want to keep advocating for a slightly more general solution that will allow for 1) different learning rate adjustment strategies, and 2) composition of learning rate strategies.
I'm not saying this PR needs to implement every single learning rate scheduler, but it would be nice to have LearningRateStrategy base class that could be extended in the future, then subclass ReduceLROnPlateau from that
I've implemented an ExponentialDecay callback and refactored things a bit.
Hi @jessegrabowski, any concrete ideas for taking this PR forward?
As you said, it would be nice to infer the optimiser from the code rather than pass it to the callback, but I don't know how to go about that.
with pm.Model() as model:
[...]
optimiser = pm.adam(learning_rate=0.1)
fit = pm.fit(
obj_optimizer=optimiser,
callbacks=[
ReduceLROnPlateau(optimiser=optimiser)
],
)
@alvaropp sorry for letting this get stale, I am going to do a careful review now/this weekend so you can get back on track and merged in ASAP
I'm going over the code carefully, and I don't think the callbacks as written do anything. There is a misunderstanding how about compiled pytensor functions work.
The implementation is very nice, and uses the following logic:
- There is a
learning_rateparameter in every optimizer object. - The optimizer object is called iteratively inside the
_iterate_with_lossmethod of theInferenceclass. - Therefore, we instantiate the optimizer class once and share it between a callback and the
Inferenceclass - At each step, the callback will update the
learning_rateof the optimizer - Which will be propagated to the
Inferenceclass (since it's the same object).
Unfortunately, this is not at all how the compiled pytensor function works. First, the implementation of the optimizers is deceptive. They are not classes, they are partial functions, whose only role is to return an updates dictionary.
This updates dictionary is important, because after a pytensor function is compiled, changes to variables outside the function will have no effect. Consider the following graph:
x = pt.dscalar('x')
y = 1.0
z = x + y
f = pytensor.function([x], z)
print(f(4)) # 5.0
print(f(5)) # 6.0
y = 2
print(f(4)) # 5.0
What happened? After running pytensor.function to compile the function, the variable y in the python namespace no longer has anything to do with the computation. If we try to naively update it, nothing will happen. This is exactly the situation we are in when running pm.fit. If you check the fit method in the Inference class, the line that actually does all the computation is this innocuous one. What, then, is step_func? It's created here, and as you can see, it's a compiled pytensor function (compiled using compile_pymc rather than pytensor.function, but if you keep digging you'll see pytensor.function is called by compile_pymc). So we're in exactly the case of my code snippet above.
To illustrate what's going on, I made a dummy "optimizer" that always adds the learning rate to the parameters:
from pymc.variational.updates import _get_call_kwargs, get_or_compute_grads
from functools import partial
from collections import OrderedDict
def dummy_optimizer(loss_or_grads=None, params=None, learning_rate=1e-3):
if loss_or_grads is None and params is None:
return partial(dummy_optimizer, **_get_call_kwargs(locals()))
elif loss_or_grads is None or params is None:
raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
# grads = get_or_compute_grads(loss_or_grads, params)
updates = OrderedDict()
for param in params:
new_param = param + learning_rate
updates[param] = new_param
return updates
We can use this optimizer to check the effect of the current schedulers. Helper functions:
def make_regression():
X = np.random.normal(size=(100,))
beta = np.random.normal()
alpha = np.random.normal()
noise = np.random.normal(size=(100,))
y = alpha + X * beta + noise
return X, y
def make_model():
X, y = make_regression()
with pm.Model() as mod:
b = pm.Normal('b')
a = pm.Normal('a')
# sigma = pm.HalfNormal('sigma')
mu = a + X * b
y_hat = pm.Normal('y_hat', mu, 1, observed=y)
return mod
Here's a base case with no scheduling callback. I do use a tracker so we can see how the dummy works:
learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
with make_model():
advi = pm.ADVI()
tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)
Check that the parameters are deterministically updated by the learning rate at every step:
mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True
Now run it again with the ExponentialDecay schedule callback:
learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
scheduler = pm.callbacks.ExponentialDecay(optimizer=optimizer, decay_steps=10, decay_rate=0.5)
with make_model():
advi = pm.ADVI()
tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
approx = advi.fit(100, callbacks=[tracker, scheduler], obj_optimizer=optimizer)
Check the mean history:
mean_history = np.r_[tracker['mean']]
np.all(np.diff(mean_history, axis=0) == learning_rate) # True
As you can see, the learning rate has not changed. We can check that the scheduler updated as expected:
scheduler.optimizer.keywords['learning_rate'] == 1 * 0.5 ** 10 # True
We are in the case of the simple pytensor code snippet posted above. The learning_rate kwarg in the optimizer is updating (this variables plays the part of the y variable in my simple example). But the train has already left the station -- the function is compiled, so this variable is totally irrelevant to the computation.
So what is the solution? To interact with a compiled function, you need to use shared variables. These are a special type of pytensor object that can be adjusted between function calls. Returning to the code snippet, let's use a shared variable:
x = pt.dscalar('x')
y = pytensor.shared(1.0, name='y')
z = x + y
f = pytensor.function([x], z)
print(f(4)) # 5.0
print(f(5)) # 6.0
y.set_value(2)
print(f(4)) # 6.0
After using a shared variable, we can adjust the value of y to adjust the output of f, even though it's already compiled.
The way that shared variables can be automatically updated from run-to-run of a function is through the use of an update dictionary. An update dictionary is a mapping from old shared values to new shared values. This is exactly the dictionary that is returned by the optimizer, as noted above. Let's take a look at the SGD function from here:
grads = get_or_compute_grads(loss_or_grads, params)
updates = OrderedDict()
for param, grad in zip(params, grads):
updates[param] = param - learning_rate * grad
return updates
You can see that it computes the gradients of the loss function, then computes SGD update rule for each parameter in turn. It stores these in an OrderedDict, where the keys are the old parameters, and the values are the updated parameters. It then returns the dictionary.
As a final demonstration, let's adjust the dummy optimizer to have an exponential decay on the learning rate:
def dummy_optimizer_w_decay(loss_or_grads=None, params=None, learning_rate=1e-3):
if loss_or_grads is None and params is None:
return partial(dummy_optimizer, **_get_call_kwargs(locals()))
elif loss_or_grads is None or params is None:
raise ValueError("Please provide both `loss_or_grads` and `params` to get updates")
# grads = get_or_compute_grads(loss_or_grads, params)
updates = OrderedDict()
if not isinstance(learning_rate, pt.sharedvar.SharedVariable):
learning_rate = pytensor.shared(pm.floatX(learning_rate), name='learning_rate')
for param in params:
new_param = param + learning_rate
updates[param] = new_param
updates[learning_rate] = learning_rate * 0.9
return updates
And see how this looks in a model:
learning_rate = 1
optimizer = dummy_optimizer(learning_rate=learning_rate)
with make_model():
advi = pm.ADVI()
tracker = pm.callbacks.Tracker(mean=advi.approx.mean.eval)
approx = advi.fit(100, callbacks=[tracker], obj_optimizer=optimizer)
Here's a plot of the tracked parameters. As you can see, we now get the desired updating in the learning rate.
mean_history = np.r_[tracker['mean']]
plt.plot(mean_history)
So can the callback approach be saved? Unfortunately, I don't think so. By the time the callbacks are invoked, in the fit loop, the step_func function is already compiled. At this stage it's too late to adjust the function without some pretty impressive gymnastics. This is going to have to go back to the drawing board using something like the wrapper approach.
I tried my hand at implementing the schedulers using the function approach I suggested on the discourse. Here's how it looks:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import numpy as np
import pytensor
import pytensor.tensor as pt
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xt = pytensor.shared(X_train)
yt = pytensor.shared(y_train)
with pm.Model() as iris_model:
# Coefficients for features
β = pm.Normal("β", 0, sigma=1e2, shape=(4, 3))
# Transoform to unit interval
a = pm.Normal("a", sigma=1e4, shape=(3,))
p = pt.special.softmax(Xt.dot(β) + a, axis=-1)
observed = pm.Categorical("obs", p=p, observed=yt)
opt = pm.adamax(learning_rate=1)
scheduled_optimizer = pm.updates.reduce_lr_on_plateau_scheduler(opt, factor=0.25, cooldown=1000, patience=3000, min_lr=1e-12)
# scheduled_optimizer = pm.updates.exponential_decay_scheduler(opt, decay_steps=1, decay_rate=0.9999, staircase=False, min_lr=1e-8)
with iris_model:
inference = pm.ADVI()
updates = inference.objective.updates(obj_optimizer=scheduled_optimizer)
inputs = list(updates.keys())
outputs = list(updates.values())
old_names = [x.name for x in inputs]
new_names = [x.name for x in outputs]
expected_names = ['best_loss', 'cooldown_counter', 'wait', 'learning_rate']
outputs_of_interest = [outputs[new_names.index(f'{expected_name}__updated')] for expected_name in
expected_names]
tracker = pm.callbacks.Tracker(best_loss = outputs_of_interest[0].eval, learning_rate = outputs_of_interest[3].eval,
wait = outputs_of_interest[2].eval, cooldown_counter = outputs_of_interest[1].eval)
convergence_trackers = [pm.callbacks.CheckParametersConvergence(every=100, diff='relative'),
pm.callbacks.CheckParametersConvergence(every=100, diff='absolute')]
approx = inference.fit(1_000_000, callbacks=[tracker] + convergence_trackers, obj_optimizer=scheduled_optimizer)
There's still bugs, and I don't fully understand the VI codebase. I refactored a lot of stuff in variational/updates.py to make it possible to pass around the loss function.
Bugs are that the schedulers assume the loss function returned by the Approximations is always a scalar, which isn't true. I wasn't able to figure out what the extra graphs returned by e.g. FullRankAVDI and SVGD are. So these work with ADVI only for now. Also tracking cooldown_counter and wait doesn't work, but clearly they are doing something internally because the learning rate is updating according to them. I'm not sure if there's a better way to expose these new variables to tracking.
Consider it a suggestion, if you come up with a better approach you can feel free to revert this commits. I was thinking your approach might work combined with some of the refactoring I did? You would need to intercept the optimizers and make the learning rate a shared variable (like I do here), then use learning_rate.set_data from your callback objects to get the updates (that you compute off the graph) into the graph. My approach is totally on the graph -- I have no intuition about which is better. The one thing I can say is that I think my approach be 100% composable, so we can implement a helper function like pm.schedule_learning_rate or whatever that could take a list of schedulers and apply them in sequence.
D'oh, good spot!
It seems like a good idea to use schedulers universally, with a Constant scheduler as the default. This may be a cleaner API and allow us to easily infer the optimizer, etc.
My proposed changes follow the pytorch API that looks like this:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
So the user has to pass the optimizer to the scheduler explicitly in the code. It's also written so that users can compose schedulers by chaining them together.
From a clean code perspective I like the idea of a "ConstantOptimizer" as a base class from which others inherit from, but I don't think it makes sense the way I have it here. If the scheduler were added as a keyword to pm.ADVI I would agree, but then we lose the ability to compose (plus we already have too many keywords).