Nadam optimizer differences
Our TF-layers Nadam optimizer is basically the same as Adam except that we use use_nesterov=True for training_ops.apply_adam. It is based on TF 1.15 tensorflow/contrib/opt/python/training/nadam_optimizer.py. So it also has the same options as normal Adam:
- learning_rate=0.001
- beta1=0.9
- beta2=0.999
- epsilon=1e-8
I noticed that tf.keras.optimizers.experimental.Nadam has some different options:
- epsilon=1e-07
- weight_decay=None
- clipnorm=None
- clipvalue=None
- global_clipnorm=None
- use_ema=False
- ema_momentum=0.99
- ema_overwrite_frequency=None
Ok, I did not further look into this. The clipping and weight decay probably is added here to decouple it. The use_ema is disabled by default, so the ema_... options are not used. So maybe it is mostly the same. Except of a different epsilon default.
See also: https://github.com/rwth-i6/returnn/issues/766#issuecomment-979216833 https://github.com/keras-team/keras/issues/15710
Now I noticed, in PyTorch, torch.optim.NAdam again has different options:
- lr=0.002
- eps=1e-08
- weight_decay=0
- momentum_decay=0.004
- decoupled_weight_decay=False
I specifically wonder about the momentum_decay. What is this?