returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Nadam optimizer differences

Open albertz opened this issue 2 years ago • 0 comments

Our TF-layers Nadam optimizer is basically the same as Adam except that we use use_nesterov=True for training_ops.apply_adam. It is based on TF 1.15 tensorflow/contrib/opt/python/training/nadam_optimizer.py. So it also has the same options as normal Adam:

  • learning_rate=0.001
  • beta1=0.9
  • beta2=0.999
  • epsilon=1e-8

I noticed that tf.keras.optimizers.experimental.Nadam has some different options:

  • epsilon=1e-07
  • weight_decay=None
  • clipnorm=None
  • clipvalue=None
  • global_clipnorm=None
  • use_ema=False
  • ema_momentum=0.99
  • ema_overwrite_frequency=None

Ok, I did not further look into this. The clipping and weight decay probably is added here to decouple it. The use_ema is disabled by default, so the ema_... options are not used. So maybe it is mostly the same. Except of a different epsilon default.

See also: https://github.com/rwth-i6/returnn/issues/766#issuecomment-979216833 https://github.com/keras-team/keras/issues/15710

Now I noticed, in PyTorch, torch.optim.NAdam again has different options:

  • lr=0.002
  • eps=1e-08
  • weight_decay=0
  • momentum_decay=0.004
  • decoupled_weight_decay=False

I specifically wonder about the momentum_decay. What is this?

albertz avatar Oct 18 '23 15:10 albertz