schedulefree optimizers

Open winglian opened this issue 4 months ago • 12 comments

What does this PR do?

integrates meta's https://github.com/facebookresearch/schedule_free for adamw & sgd

https://twitter.com/aaron_defazio/status/1776320004465582331

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@muellerzr @younesbelkada @pacman100

Apr 06 '24 03:04 winglian

FYI this will need https://github.com/huggingface/accelerate/pull/2631 as we need to upstream accelerate's ability to call train/eval on a wrapped optimizer

Apr 06 '24 11:04 muellerzr

Some thoughts:

I was trying to ask Aaron et al on Twitter if they did any transformer experiments, but to no avail. They said a paper will come in 1 or 2 months.
Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.

I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/ wasn't pleasant.

Apr 07 '24 02:04 danielhanchen

Should be very easy to test this on Phi-2 or TinyLlama when the implementation works?

Apr 07 '24 06:04 PhilipMay

This PR should maybe also add a few lines to the README about "how to use this".

Apr 08 '24 10:04 PhilipMay

We've merged the accelerate portion in, so if anyone is trying this out in distributed fashions, you can do pip install git+https://github.com/huggingface/accelerate :)

Apr 08 '24 15:04 muellerzr

There is any chance of this making into the main branch? I and other confirmed that the results are real. Thank you @winglian

Apr 14 '24 16:04 bratao

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 29 '24 12:04 HuggingFaceDocBuilderDev

Is their any remaining work I could contribute towards getting this PR merged?

Cheers

May 09 '24 05:05 CoffeeVampir3

transformers transformers copied to clipboard

schedulefree optimizers

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard