transformers icon indicating copy to clipboard operation
transformers copied to clipboard

schedulefree optimizers

Open winglian opened this issue 4 months ago • 12 comments

What does this PR do?

integrates meta's https://github.com/facebookresearch/schedule_free for adamw & sgd

https://twitter.com/aaron_defazio/status/1776320004465582331

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

@muellerzr @younesbelkada @pacman100

winglian avatar Apr 06 '24 03:04 winglian

FYI this will need https://github.com/huggingface/accelerate/pull/2631 as we need to upstream accelerate's ability to call train/eval on a wrapped optimizer

muellerzr avatar Apr 06 '24 11:04 muellerzr

Some thoughts:

  • I was trying to ask Aaron et al on Twitter if they did any transformer experiments, but to no avail. They said a paper will come in 1 or 2 months.
  • Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
  • Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
  • A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.

I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/ wasn't pleasant.

danielhanchen avatar Apr 07 '24 02:04 danielhanchen

Should be very easy to test this on Phi-2 or TinyLlama when the implementation works?

PhilipMay avatar Apr 07 '24 06:04 PhilipMay

This PR should maybe also add a few lines to the README about "how to use this".

PhilipMay avatar Apr 08 '24 10:04 PhilipMay

We've merged the accelerate portion in, so if anyone is trying this out in distributed fashions, you can do pip install git+https://github.com/huggingface/accelerate :)

muellerzr avatar Apr 08 '24 15:04 muellerzr

There is any chance of this making into the main branch? I and other confirmed that the results are real. Thank you @winglian

bratao avatar Apr 14 '24 16:04 bratao

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Is their any remaining work I could contribute towards getting this PR merged?

Cheers

CoffeeVampir3 avatar May 09 '24 05:05 CoffeeVampir3