mlx icon indicating copy to clipboard operation
mlx copied to clipboard

Feature Request: Add AdaBelief Optimizer

Open jganbar opened this issue 4 months ago • 1 comments

Summary

I would like to propose adding the AdaBelief optimizer to MLX's optimizer suite. AdaBelief is a modern adaptive optimizer that has shown superior performance compared to Adam across various machine learning tasks.

Why AdaBelief?

Proven Performance

  • Consistently outperforms Adam on image classification, language modeling, and machine translation tasks
  • Achieves faster convergence and better generalization in most scenarios
  • More stable training dynamics, especially in later stages of training

Research Impact & Adoption

  • Published at NeurIPS 2020 by researchers from Yale University
  • 400+ citations since publication, showing strong research community adoption
  • Already implemented in major frameworks (PyTorch, TensorFlow, JAX)
  • Widely used by practitioners and researchers

Technical Advantages

  • Same computational cost as Adam (no additional overhead)
  • Same memory requirements as Adam
  • Drop-in replacement for Adam in existing workflows
  • Numerically stable with proper epsilon handling

MLX-Specific Benefits

  • Perfect fit for Apple Silicon optimization due to similar computational pattern to Adam
  • Can leverage MLX's unified memory model efficiently
  • Enhances MLX's appeal to the research community
  • Aligns with MLX's goal of providing modern ML tools

References

  1. Original Paper: https://arxiv.org/abs/2010.07468
  2. Official Implementation: https://github.com/juntang-zhuang/Adabelief-Optimizer
  3. PyTorch Documentation: Available in torch.optim

I'm excited to contribute to MLX and help expand its optimizer ecosystem. Looking forward to your feedback and guidance on how to proceed.

jganbar avatar Aug 09 '25 07:08 jganbar

Hi everyone,

I’d love to work on adding the AdaBelief optimizer to mlx.optimizers as proposed here.

Planned implementation

  • API similar to Adam / AdamW (learning_rate, betas, eps, weight_decay, optional bias_correction)
  • Core update rule from the NeurIPS 2020 paper (EMA of (g - m)^2 for the variance term)
  • Same computational/memory cost as Adam

Questions before starting

  1. Should we keep eps=1e-8 (matching Adam) or use the paper’s 1e-16 for numerical stability?
  2. The paper enables bias correction (bias_correction=True), while MLX’s Adam currently defaults to False. Which behavior should I follow for consistency?
  3. Should decoupled weight decay (AdamW style) be part of the first version?

Planned additions

  • Unit tests under python/tests/test_optimizers.py
  • Docstring + small usage example in Optimizers docs

If this scope looks good, I can start on the implementation right away.

Thanks for reviewing, and looking forward to your feedback!

harsh-sutariya avatar Oct 19 '25 18:10 harsh-sutariya