Nadam optimizer differences

Open albertz opened this issue 2 years ago • 0 comments

Our TF-layers Nadam optimizer is basically the same as Adam except that we use use_nesterov=True for training_ops.apply_adam. It is based on TF 1.15 tensorflow/contrib/opt/python/training/nadam_optimizer.py. So it also has the same options as normal Adam:

learning_rate=0.001
beta1=0.9
beta2=0.999
epsilon=1e-8

I noticed that tf.keras.optimizers.experimental.Nadam has some different options:

epsilon=1e-07
weight_decay=None
clipnorm=None
clipvalue=None
global_clipnorm=None
use_ema=False
ema_momentum=0.99
ema_overwrite_frequency=None

Ok, I did not further look into this. The clipping and weight decay probably is added here to decouple it. The use_ema is disabled by default, so the ema_... options are not used. So maybe it is mostly the same. Except of a different epsilon default.

See also: https://github.com/rwth-i6/returnn/issues/766#issuecomment-979216833 https://github.com/keras-team/keras/issues/15710

Now I noticed, in PyTorch, torch.optim.NAdam again has different options:

lr=0.002
eps=1e-08
weight_decay=0
momentum_decay=0.004
decoupled_weight_decay=False

I specifically wonder about the momentum_decay. What is this?

Oct 18 '23 15:10 albertz