community RFC: Easily Customizable Optimizer.minimize

Status In Revision

RFC # 234

Author(s) omalleyt12@

Sponsor apassos@, fchollet@, karmel@

Updated 2020-04-20

Status	In Revision
RFC #	234
Author(s)	omalleyt12@
Sponsor	apassos@, fchollet@, karmel@
Updated	2020-04-20

Objective

Create an Optimizer API that gives Optimizer subclasses full control of gradient updates. The API should ensure Optimizers can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the Optimizer chooses to:

Scale losses and gradients
Aggregate gradients
Clip gradients
etc

We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.

By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.

Specific use cases considered:

Gradient clipping
Mixed precision
Horovod

Apr 20 '20 20:04 omalleyt12

In tensorflow Addons, we have multiple wrappers/hooks that need to work with arbitrary optimizers. So we need to make sure the API covers those use cases:

@Squadrick could you take a look and ensure that the Stochastic Weight Averaging and the moving average wrapper can be implemented with this new API?
@cyberzhg could you take a look and ensure that the lookahead mechanism can be implemented with the new API?
@philjd could you take a look and ensure the decoupled weight decay mechanism can be implemented with this new API?

Thank you in advance. After this RFC is merged and implemented, we should migrate those 4 wrappers to the new API.

Link to the nicely formatted document: https://github.com/omalleyt12/community/blob/patch-1/rfcs/2020-04-20-Optimizer-minimize.md

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper?

If possible, could you all write some pseudo code of what each wrapper implementation would look like with this new API? That would allow us to answer this last question.

Apr 20 '20 21:04 gabrieldemarmiesse

Please send out an email to [email protected] with the announcement of this pull request and give at least a two week review period per: https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md

Apr 20 '20 21:04 ematejska

Thanks for the ping @gabrieldemarmiesse! Unfortunately, I've never used keras, so I'm missing a lot of context and I might oversee some solution to this - I'd be super happy about some comments on the propsals below :)

Our API use case: In general, the idea of decoupled weight decay is to decouple the decaying variable update from the optimizer variable update. While weight decay is equal to L2 penalty on weights for some optimizers, this is not true for adaptive optimizers such as Adam. So the more general solution would be something like (pseudo code)

vars = optimizer.apply_gradients(grads_vars)
vars -= weight_decay * vars

The challenge here is that the set of variables to be decayed might not be the same as the set of trainable variables (e.g., we might want to exclude batch norm parameters). For maximum flexibility we've been wrapping the optimizers minimize method and allowed to pass a list with the variables to decay as minimize(loss, var_list, ..., decay_var_list=None).

Possible solutions:

Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.
Create a second optimzier that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)
Always decay all variables. To not decay some variables, one would need to create separate optimizers for those. Again, I don't know if using two or more different optimizers is possible in Keras (especially with model.fit).

Apr 21 '20 13:04 PhilJd

Thanks for looping these people in @gabrieldemarmiesse!

@PhilJd Thanks for the comments!

Re weight decay: In general you can still wrap Optimizer.minimize to perform this:

class WeightDecayOptimizer(Optimizer):

  def minimize(self, loss, variables, tape, decay_variables=None):
    if decay_variables is None:
      decay_variables = variables
   decay_variables -= self.weight_decay * decay_variables
   super(WeightDecayOptimizer, self).minimize(loss, variables, tape)

The decay_variables argument would not be passed in the default Model.fit, so only the use case where the variables and decay_variables are the same would be supported. This is the case for Model.fit today though, so that wouldn't be a regression. Users could also override Model.train_step (which allows users to provide their own custom training step logic) to support this use case if they want.

Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.

This would work, although it's worth noting as well that oftentimes, Variables are created when the Model is first called on data, so for most workflows the Model Variables aren't guaranteed to exist when the Optimizer is created.

Create a second optimizer that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)

This wouldn't work with the default Model.fit logic, whereas the decay_variables argument pattern at least works as long as the decay_variables and variables are the same.

In general there's probably a question of whether Model.fit should have a separate configuration for how to decay variables, since it seems something orthogonal to the Optimizer (in that IIUC, it never needs to know about the gradients)

Apr 21 '20 16:04 omalleyt12

/CC @DEKHTIARJonathan, @tgaddair. This API will be relevant to Horovod's distributed optimizer.

Apr 24 '20 23:04 reedwm

Thanks @reedwm, much appreciated to be informed :)

I actually truely like the idea of this RFC. I have defended many points of the design made in this RFC in this PR: https://github.com/tensorflow/tensorflow/issues/36398.

About:

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper? @omalleyt12

I would be in favor of this. Having OptimizerWrapper would greatly improved the maintability of AMP and Horovod and allow us to provide "a standard" way to plug into TensorFlow for anyone thinking of doing it. And on the core API side, it simplifies API support because you can state that the public API of the OptimizerWrapper class are the only ones supported on the long term and in the background interface it in whatever way is needed, totally abstract from Horovod or AMP Optimizers.

Overall, I'm really enthusiastic about this RFC. Sounds like a great improvement :+1:

Apr 24 '20 23:04 DEKHTIARJonathan

@hyang0129 could you take a look at this RFC to see if it would make https://github.com/tensorflow/addons/pull/969 easier to implement? Maybe add some comments if you have ideas.

@Squadrick @CyberZHG friendly ping to see if this rfc can make the implementation of Stochastic Weight Averaging , moving average wrapper and lookahead mechanism simpler in tensorflow addons.

May 02 '20 12:05 gabrieldemarmiesse

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

May 02 '20 14:05 hyang0129

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

@hyang0129 Agreed, the loss scaling optimizer example doesn't actually scale the gradients. What it does is that it temporarily scales up the loss to compute gradients in a numerically stable way, and then unscales them before applying Variable updates

For other, user-written gradient transformations such as gradient clipping, ensuring that the math is doing what they want is up to them

May 04 '20 20:05 omalleyt12

@gabrieldemarmiesse Sorry about the delay. Moving average_wrapper would be fairly easy. We need to override transform_unaggregated_gradientsonly.

Once average_wrapper is migrated, migrating moving_average and stochastic_weight_average optimizers would be very simple.

May 05 '20 06:05 Squadrick

@omalleyt12 Please post the notes from the design review here.

May 12 '20 15:05 ematejska

@omalleyt12 A gentle ping so we can finish this up.

Jun 09 '20 04:06 ematejska

@omalleyt12 told me that this design will likely need another round of review at some point as there were differing opinions.

Jun 09 '20 16:06 ematejska

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

Aug 19 '20 09:08 adriendoerig

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

https://github.com/tensorflow/tensorflow/commit/53576063848800b6ee906c6e94d840fbdbbecd1b

Aug 19 '20 15:08 byronyi

@adriendoerig if you use Horovod that would be already quite easy to perform :)

Aug 19 '20 19:08 DEKHTIARJonathan

Just wondering, why clip by global norm also not easily supported for distribution strategy? It would be really helpful as it keeps the gradient direction unchanged.

Dec 28 '21 12:12 DepenM

@omalleyt12, @fchollet Are you still pursuing this or should this be closed out?

Jan 31 '22 18:01 ematejska