Flax Optimizers

A collection of optimizers for Flax. The repository is open to pull requests.

Installation

You can install this librarie with:

pip install git+https://github.com/nestordemeure/flaxOptimizers.git

Classical optimizers, inherited from the official Flax implementation:

Adafactor A memory efficient optimizer, has been used for large-scale training of attention-based models.
Adagrad Introduces a denominator to SGD so that each parameter has its own learning rate.
Adam The most common stochastic optimizer nowadays.
LAMB Improvement on LARS to makes it efficient across task types.
LARS An optimizer designed for large batch.
Momentum SGD with momentum, optionally Nesterov momentum.
RMSProp Developped to solve Adagrad's diminushing learning rate problem.
SGD The simplest stochastic gradient descent optimizer possible.

More arcane first-order optimizers:

AdamHD Uses hypergradient descent to tune its own learning rate. Good at the begining of the training but tends to underperform at the end.
AdamP Corrects premature step-size decay for scale-invariant weights. Useful when a model uses some form of Batch normalization.
LapProp Applies exponential smoothing to update rather than gradient.
MADGRAD Modernisation of the Adagrad family of optimizers, very competitive with Adam.
RAdam Uses a rectified variance estimation to compute the learning rate. Makes training smoother, especially in the first iterations.
RAdamSimplified Warmup strategy proposed to reproduce RAdam's result with a much decreased code complexity.
Ranger Combines look-ahead, RAdam and gradient centralization to try and maximize performances. Designed with picture classification problems in mind.
Ranger21 An upgrade of Ranger that combines adaptive gradient clipping, gradient centralization, positive-negative momentum, norm loss, stable weight-decay, linear learning rate warm up, explore exploite scheduling, lookahead and Adam. It has been designed with transformers in mind.
Sadam Introduces an alternative to the epsilon parameter.

Optimizer wrappers:

WeightNorm Alternative to BatchNormalization, does the weight normalization inside the optimizer which makes it compatible with more models and faster (official Flax implementation)

AdahessianJax contains my implementation of the Adahessian second order optimizer in Flax.
Flax.optim contains a number of optimizer that currently do not appear in the official documentation. They are all included accesible from this librarie.