flaxOptimizers
flaxOptimizers copied to clipboard
A collection of optimizers, some arcane others well known, for Flax.
Flax Optimizers
A collection of optimizers for Flax. The repository is open to pull requests.
Installation
You can install this librarie with:
pip install git+https://github.com/nestordemeure/flaxOptimizers.git
Optimizers
Classical optimizers, inherited from the official Flax implementation:
- Adafactor A memory efficient optimizer, has been used for large-scale training of attention-based models.
- Adagrad Introduces a denominator to SGD so that each parameter has its own learning rate.
- Adam The most common stochastic optimizer nowadays.
- LAMB Improvement on LARS to makes it efficient across task types.
- LARS An optimizer designed for large batch.
- Momentum SGD with momentum, optionally Nesterov momentum.
- RMSProp Developped to solve Adagrad's diminushing learning rate problem.
- SGD The simplest stochastic gradient descent optimizer possible.
More arcane first-order optimizers:
- AdamHD Uses hypergradient descent to tune its own learning rate. Good at the begining of the training but tends to underperform at the end.
- AdamP Corrects premature step-size decay for scale-invariant weights. Useful when a model uses some form of Batch normalization.
- LapProp Applies exponential smoothing to update rather than gradient.
- MADGRAD Modernisation of the Adagrad family of optimizers, very competitive with Adam.
- RAdam Uses a rectified variance estimation to compute the learning rate. Makes training smoother, especially in the first iterations.
- RAdamSimplified Warmup strategy proposed to reproduce RAdam's result with a much decreased code complexity.
- Ranger Combines look-ahead, RAdam and gradient centralization to try and maximize performances. Designed with picture classification problems in mind.
- Ranger21 An upgrade of Ranger that combines adaptive gradient clipping, gradient centralization, positive-negative momentum, norm loss, stable weight-decay, linear learning rate warm up, explore exploite scheduling, lookahead and Adam. It has been designed with transformers in mind.
- Sadam Introduces an alternative to the epsilon parameter.
Optimizer wrappers:
- WeightNorm Alternative to BatchNormalization, does the weight normalization inside the optimizer which makes it compatible with more models and faster (official Flax implementation)
Other references
- AdahessianJax contains my implementation of the Adahessian second order optimizer in Flax.
- Flax.optim contains a number of optimizer that currently do not appear in the official documentation. They are all included accesible from this librarie.