Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Robin M. Schmidt, Frank Schneider, and Philipp Hennig

Paper: [ICML 2021]

Abstract: Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than 50,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

Results

This repository provides the full log files of all our benchmarking results. They are organized into:

Main results: Comparison of fifteen popular deep learning optimizers on eight problems, using four different tuning budgets and four different learning rate schedules. The main results amount to more than 50,000 individual runs.
Tuning validation: In order to test the stability of our benchmark, we re-tuned two optimizers (RMSProp and AdaDelta) on all problems a second time. The results of this evaluation are shown in Appendix D in our paper.
Seed robustness: An optimizer's performance can be sensitive to the random seed. In this analysis, we performed an extensive grid search on the learning rate for SGD, using ten different seeds throughout. This identifies a “danger zone” of learning rates, that can be sensitive to the random seed. The full analysis can be found in Appendix C of our paper.

Everyone is invited to use those results, for example, to evaluate the performance of newly developed optimizers or as training data for meta-learned optimizers.

We are happy to extend our results with additional optimizers, provided they are generated using the same process described in the paper, to assure a fair comparison.

Below we highlight two figures summarizing our evaluation of those results. For the full evaluation with all our plots and explanations, please check out our paper.

Parallel Coordinates Plot

Heatmap

List of Optimizers

Our paper lists over hundred optimizers that have been proposed for deep learning applications. Here, we keep a digital version that gets updated regularly. If you want to add a method or find an out-of-date reference, please feel free to open a pull request.

AcceleGrad (Levy et al., 2018) [NeurIPS]
ACClip (Zhang et al., 2020) [NeurIPS]
AdaAlter (Xie et al., 2019) [arXiv]
AdaBatch (Devarakonda et al., 2017) [arXiv]
AdaBayes/AdaBayes-SS (Aitchison, 2020) [NeurIPS]
AdaBelief (Zhuang et al., 2020) [NeurIPS]
AdaBlock (Yun et al., 2019) [arXiv]
AdaBound/AMSBound (Luo et al., 2019) [ICLR]
AdaComp (Chen et al., 2018) [AAAI]
Adadelta (Zeiler, 2012) [arXiv]
Adafactor (Shazeer & Stern, 2018) [ICML]
AdaFix (Bae et al., 2019) [arXiv]
AdaFom (Chen et al., 2019) [ICLR]
AdaFTRL (Orabona & Pál, 2015) [ALT]
Adagrad (Duchi et al., 2011) [JMLR]
ADAHESSIAN (Yao et al., 2020) [arXiv]
Adai (Xie et al., 2020) [arXiv]
AdaLoss (Teixeira et al., 2019) [arXiv]
Adam (Kingma & Ba, 2015) [ICLR]
Adam+ (Liu et al., 2020) [arXiv]
AdamAL (Tao et al., 2019) [OpenReview]
AdaMax (Kingma & Ba, 2015) [ICLR]
AdamBS (Liu et al., 2020) [arXiv]
AdamNC (Reddi et al., 2018) [ICLR]
AdaMod (Ding et al., 2019) [arXiv]
AdamP (Heo et al., 2020) [arXiv]
AdamT (Zhou et al., 2020) [arXiv]
AdamW (Loshchilov & Hutter, 2019) [ICLR]
AdamX (Tran & Phong et al., 2019) [IEEE Access]
ADAS (Eliyahu, 2020) [Github]
AdaS (Hosseini & Plataniotis, 2020) [arXiv]
AdaScale (Johnson et al., 2020) [ICML]
AdaSGD (Wang & Wiens, 2020) [arXiv]
AdaShift (Zhou et al., 2019) [ICLR]
AdaSqrt (Hu et al., 2019) [arXiv]
Adathm (Sun et al., 2019) [IEEE DSC]
AdaX/AdaX-W (Li et al., 2020) [arXiv]
AEGD (Liu & Tian, 2020) [arXiv]
ALI-G (Berrada et al., 2020) [ICML]
AMSGrad (Reddi et al., 2018) [ICLR]
AngularGrad (Roy et al., 2021) [arXiv]
ArmijoLS (Vaswani et al., 2019) [NeurIPS]
ARSG (Chen et al., 2019) [arXiv]
ASAM (Kwon et al., 2021) [arXiv]
AutoLRS (Jin et al., 2021) [ICLR]
AvaGrad (Savarese et al., 2019) [arXiv]
BAdam (Salas et al., 2018) [arXiv]
BGAdam (Bai & Zhang, 2019) [arXiv]
BPGrad (Zhang et al., 2017) [arXiv]
BRMSProp (Aitchison, 2020) [NeurIPS]
BSGD (Hu et al., 2020) [NeurIPS]
C-ADAM (Tutunov et al., 2020) [arXiv]
CADA (Chen et al., 2020) [arXiv]
COCOB (Orabona & Tommasi, 2017) [NeurIPS]
Cool Momentum (Borysenko & Byshkin, 2020) [arXiv]
CProp (Preechakul & Kijsirikul et al., 2019) [arXiv]
Curveball (Henriques et al., 2019) [ICCV]
Dadam (Nazari et al., 2019) [arXiv]
DeepMemory (Wright, 2020) [Github]
DGNOpt (Liu et al., 2021) [ICML]
DiffGrad (Dubey et al., 2020) [IEEE Transactions on Neural Networks and Learning Systems]
EAdam (Yuan & Gao, 2020) [arXiv]
EKFAC (George et al., 2018) [NeurIPS]
Eve (Hayashi, 2016) [arXiv]
Expectigrad (Daley & Amato, 2020) [arXiv]
FastAdaBelief (Zhou et al., 2021) [arXiv]
FRSGD (Wang & Ye, 2020) [arXiv]
G-AdaGrad (Chakrabarti & Chopra, 2021) [arXiv]
GADAM (Zhang & Gouza et al., 2018) [arXiv]
Gadam (Granziol et al., 2020) [arXiv]
GOALS (Chae et al., 2021) [arXiv]
GOLS-I (Kafka & Wilke, 2019) [arXiv]
Grad-Avg (Purkayastha & Purkayastha, 2020) [arXiv]
GRAPES (Dellaferrera et al., 2021) [arXiv]
Gravilon (Kelterobrn et al., 2020) [arXiv]
Gravity (Bahrami & Zadeh, 2021) [arXiv]
HAdam (Jiang et al., 2019) [arXiv]
HyperAdam (Wang et al., 2019) [AAAI]
K-BFGS/K-BFGS(L) (Goldfarb et al., 2020) [NeurIPS]
K-FAC (Martens & Grosse, 2015) [ICML]
KF-QN-CNN (Ren & Goldfarb, 2021) [arXiv]
KFLR/KFRA (Botev et al., 2017) [ICML]
L4Adam/L4Momentum (Rolínek & Martius, 2018) [NeurIPS]
LAMB (You et al., 2020) [ICLR]
LaProp (Ziyin et al., 2020) [arXiv]
LARS (You et al., 2017) [ICLR]
LHOPT (Almeida et al., 2021) [arXiv]
LookAhead (Zhang et al., 2019) [NeurIPS]
M-SVAG (Balles & Hennig, 2018) [ICML]
MADGRAD (Defazio & Jelassi, 2021) [arXiv]
MAS (Landro et al., 2020) [arXiv]
MEKA (Chen et al., 2020) [arXiv]
MTAdam (Malkiel & Wolf, 2020) [arXiv]
MVRC-1/MVRC-2 (Chen & Zhou, 2020) [arXiv]
Nadam (Dozat, 2016) [ICLR]
NAMSB/NAMSG (Chen et al., 2019) [arXiv]
ND-Adam (Zhang et al., 2017) [arXiv]
Nero (Liu et al., 2021) [arXiv]
Nesterov Accelerated Momentum (Nesterov, 1983) [Soviet Mathematics Doklady]
Noisy Adam/Noisy K-FAC (Zhang et al., 2018) [ICML]
NosAdam (Huang et al., 2019) [IJCAI]
Novograd (Ginsburg et al., 2019) [arXiv]
NT-SGD (Zhou et al., 2021) [arXiv]
Padam (Chen et al., 2020) [IJCAI]
PAGE (Li et al., 2020) [arXiv]
PAL (Mutschler & Zell, 2020) [NeurIPS]
PolyAdam (Orvieto et al., 2019) [UAI]
Polyak Momentum (Polyak, 1964) [USSR Computational Mathematics and Mathematical Physics]
PowerSGD/PowerSGDM (Vogels et al., 2019) [NeurIPS]
Probabilistic Polyak (de Roos et al., 2021) [arXiv]
ProbLS (Mahsereci & Hennig, 2017) [JMLR]
PSTorm (Xu, 2020) [arXiv]
QHAdam/QHM (Ma & Yarats, 2019) [ICLR]
RAdam (Liu et al., 2020) [ICLR]
Ranger (Wright, 2020) [Github]
RangerLars (Grankin et al., 2020) [Github]
RecursiveOptimizer (Cutkosky & Sarlos, 2019) [ICML]
RescaledExp (Cutkosky & Boahen, 2019) [NeurIPS]
RMSProp (Tieleman & Hinton, 2012) [Coursera]
RMSterov (Choi et al., 2019) [arXiv]
S-SGD (Sung et al., 2020) [arXiv]
SAdam (Wang et al., 2020) [ICLR]
Sadam/SAMSGrad (Tong et al., 2019) [arXiv]
SALR (Yue et al., 2020) [arXiv]
SAM (Foret et al., 2021) [ICLR]
SC-Adagrad/SC-RMSProp (Mukkamala & Hein, 2017) [ICML]
SDProp (Ida et al., 2017) [IJCAI]
SGD (Robbins & Monro, 1951) [Annals of Mathematical Statistics]
SGD-BB (Tan et al., 2016) [NeurIPS]
SGD-G2 (Ayadi & Turinici, 2020) [arXiv]
SGDEM (Ramezani-Kebrya et al., 2021) [arXiv]
SGDHess (Tran & Cutkosky, 2021) [arXiv]
SGDM (Liu & Luo, 2020) [arXiv]
SGDP (Heo et al., 2020) [arXiv]
SGDR (Loshchilov & Hutter, 2017) [ICLR]
SHAdagrad (Huang et al., 2020) [arXiv]
Shampoo (Gupta et al., 2018) [ICML]
SignAdam++ (Wang et al., 2019) [ICDMW]
SignSGD (Bernstein et al., 2018) [ICML]
SKQN/S4QN (Yang et al., 2020) [arXiv]
SM3 (Anil et al., 2019) [NeurIPS]
SMG (Tran et al., 2020) [arXiv]
SNGM (Zhao et al., 2020 [arXiv]
SoftAdam (Fetterman et al., 2019) [OpenReview]
SRSGD (Wang et al., 2020) [arXiv]
Step-Tuned SGD (Castera et al., 2021) [arXiv]
STORM (Cutkosky & Orabona, 2019) [NeurIPS]
SWATS (Keskar & Socher, 2017) [arXiv]
SWNTS (Chen et al., 2019) [arXiv]
TAdam (Ilboudo et al., 2020) [arXiv]
TEKFAC (Gao et al., 2020) [arXiv]
VAdam (Khan et al., 2018) [ICML]
VR-SGD (Shang et al., 2020) [IEEE Trans. Knowl. Data Eng.]
vSGD-b/vSGD-g/vSGD-l (Schaul et al., 2013) [ICML]
vSGD-fd (Schaul & LeCun, 2013) [ICLR]
WNGrad (Wu et al., 2018) [arXiv]
YellowFin (Zhang & Mitliagkas, 2019) [MLSys]
Yogi (Zaheer et al., 2018) [NeurIPS]

Citation

If you use our results, please consider citing:

Robin M. Schmidt, Frank Schneider, Philipp Hennig
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
ICML 2021

@InProceedings{pmlr-v139-schmidt21a,
  title = 	 {Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers},
  author =       {Schmidt, Robin M and Schneider, Frank and Hennig, Philipp},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {9367--9376},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/schmidt21a/schmidt21a.pdf},
  url = 	 {http://proceedings.mlr.press/v139/schmidt21a.html}
}

Contact

Should you have any questions, please feel free to contact the authors.

Crowded-Valley---Results
Crowded-Valley---Results copied to clipboard

Metadata

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Results

List of Optimizers

Citation

Contact

← Metadata

Owner

Metadata

Crowded-Valley---Results Crowded-Valley---Results copied to clipboard

Metadata

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Results

List of Optimizers

Citation

Contact

← Metadata

Owner

Metadata

Crowded-Valley---Results
Crowded-Valley---Results copied to clipboard