RFC: Easily Customizable Optimizer.minimize
| Status | In Revision |
|---|---|
| RFC # | 234 |
| Author(s) | omalleyt12@ |
| Sponsor | apassos@, fchollet@, karmel@ |
| Updated | 2020-04-20 |
Objective
Create an Optimizer API that gives Optimizer subclasses full control of gradient updates. The API should ensure Optimizers can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the Optimizer chooses to:
-
Scale losses and gradients
-
Aggregate gradients
-
Clip gradients
-
etc
We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.
By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.
Specific use cases considered:
-
Gradient clipping
-
Mixed precision
-
Horovod
In tensorflow Addons, we have multiple wrappers/hooks that need to work with arbitrary optimizers. So we need to make sure the API covers those use cases:
- @Squadrick could you take a look and ensure that the Stochastic Weight Averaging and the moving average wrapper can be implemented with this new API?
- @cyberzhg could you take a look and ensure that the lookahead mechanism can be implemented with the new API?
- @philjd could you take a look and ensure the decoupled weight decay mechanism can be implemented with this new API?
Thank you in advance. After this RFC is merged and implemented, we should migrate those 4 wrappers to the new API.
Link to the nicely formatted document: https://github.com/omalleyt12/community/blob/patch-1/rfcs/2020-04-20-Optimizer-minimize.md
Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper?
If possible, could you all write some pseudo code of what each wrapper implementation would look like with this new API? That would allow us to answer this last question.
Please send out an email to [email protected] with the announcement of this pull request and give at least a two week review period per: https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md
Thanks for the ping @gabrieldemarmiesse! Unfortunately, I've never used keras, so I'm missing a lot of context and I might oversee some solution to this - I'd be super happy about some comments on the propsals below :)
Our API use case: In general, the idea of decoupled weight decay is to decouple the decaying variable update from the optimizer variable update. While weight decay is equal to L2 penalty on weights for some optimizers, this is not true for adaptive optimizers such as Adam. So the more general solution would be something like (pseudo code)
vars = optimizer.apply_gradients(grads_vars)
vars -= weight_decay * vars
The challenge here is that the set of variables to be decayed might not be the same as the set of trainable variables (e.g., we might want to exclude batch norm parameters). For maximum flexibility
we've been wrapping the optimizers minimize method and allowed to pass a list with the variables to decay as minimize(loss, var_list, ..., decay_var_list=None).
Possible solutions:
- Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.
- Create a second optimzier that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)
- Always decay all variables. To not decay some variables, one would need to create separate optimizers for those. Again, I don't know if using two or more different optimizers is possible in Keras (especially with model.fit).
Thanks for looping these people in @gabrieldemarmiesse!
@PhilJd Thanks for the comments!
Re weight decay: In general you can still wrap Optimizer.minimize to perform this:
class WeightDecayOptimizer(Optimizer):
def minimize(self, loss, variables, tape, decay_variables=None):
if decay_variables is None:
decay_variables = variables
decay_variables -= self.weight_decay * decay_variables
super(WeightDecayOptimizer, self).minimize(loss, variables, tape)
The decay_variables argument would not be passed in the default Model.fit, so only the use case where the variables and decay_variables are the same would be supported. This is the case for Model.fit today though, so that wouldn't be a regression. Users could also override Model.train_step (which allows users to provide their own custom training step logic) to support this use case if they want.
Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.
This would work, although it's worth noting as well that oftentimes, Variables are created when the Model is first called on data, so for most workflows the Model Variables aren't guaranteed to exist when the Optimizer is created.
Create a second optimizer that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)
This wouldn't work with the default Model.fit logic, whereas the decay_variables argument pattern at least works as long as the decay_variables and variables are the same.
In general there's probably a question of whether Model.fit should have a separate configuration for how to decay variables, since it seems something orthogonal to the Optimizer (in that IIUC, it never needs to know about the gradients)
/CC @DEKHTIARJonathan, @tgaddair. This API will be relevant to Horovod's distributed optimizer.
Thanks @reedwm, much appreciated to be informed :)
I actually truely like the idea of this RFC. I have defended many points of the design made in this RFC in this PR: https://github.com/tensorflow/tensorflow/issues/36398.
About:
Should we create a utility class to help with wrapping an
Optimizer? I.e.OptimizerWrapper? @omalleyt12
I would be in favor of this. Having OptimizerWrapper would greatly improved the maintability of AMP and Horovod and allow us to provide "a standard" way to plug into TensorFlow for anyone thinking of doing it. And on the core API side, it simplifies API support because you can state that the public API of the OptimizerWrapper class are the only ones supported on the long term and in the background interface it in whatever way is needed, totally abstract from Horovod or AMP Optimizers.
Overall, I'm really enthusiastic about this RFC. Sounds like a great improvement :+1:
@hyang0129 could you take a look at this RFC to see if it would make https://github.com/tensorflow/addons/pull/969 easier to implement? Maybe add some comments if you have ideas.
@Squadrick @CyberZHG friendly ping to see if this rfc can make the implementation of Stochastic Weight Averaging , moving average wrapper and lookahead mechanism simpler in tensorflow addons.
@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.
@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.
@hyang0129 Agreed, the loss scaling optimizer example doesn't actually scale the gradients. What it does is that it temporarily scales up the loss to compute gradients in a numerically stable way, and then unscales them before applying Variable updates
For other, user-written gradient transformations such as gradient clipping, ensuring that the math is doing what they want is up to them
@gabrieldemarmiesse Sorry about the delay. Moving average_wrapper would be fairly easy. We need to override transform_unaggregated_gradientsonly.
Once average_wrapper is migrated, migrating moving_average and stochastic_weight_average optimizers would be very simple.
@omalleyt12 Please post the notes from the design review here.
@omalleyt12 A gentle ping so we can finish this up.
@omalleyt12 told me that this design will likely need another round of review at some point as there were differing opinions.
Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!
Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!
https://github.com/tensorflow/tensorflow/commit/53576063848800b6ee906c6e94d840fbdbbecd1b
@adriendoerig if you use Horovod that would be already quite easy to perform :)
Just wondering, why clip by global norm also not easily supported for distribution strategy? It would be really helpful as it keeps the gradient direction unchanged.
@omalleyt12, @fchollet Are you still pursuing this or should this be closed out?