transformers
transformers copied to clipboard
schedulefree optimizers
What does this PR do?
integrates meta's https://github.com/facebookresearch/schedule_free for adamw & sgd
https://twitter.com/aaron_defazio/status/1776320004465582331
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@muellerzr @younesbelkada @pacman100
FYI this will need https://github.com/huggingface/accelerate/pull/2631 as we need to upstream accelerate's ability to call train/eval on a wrapped optimizer
Some thoughts:
- I was trying to ask Aaron et al on Twitter if they did any transformer experiments, but to no avail. They said a paper will come in 1 or 2 months.
- Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
- Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
- A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.
I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/ wasn't pleasant.
Should be very easy to test this on Phi-2 or TinyLlama when the implementation works?
This PR should maybe also add a few lines to the README about "how to use this".
We've merged the accelerate portion in, so if anyone is trying this out in distributed fashions, you can do pip install git+https://github.com/huggingface/accelerate :)
There is any chance of this making into the main branch? I and other confirmed that the results are real. Thank you @winglian
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Is their any remaining work I could contribute towards getting this PR merged?
Cheers
@pacman100 @muellerzr @younesbelkada Can we get a new review to get this merged? Since the last check, I rebased, added some fixes and docs.
@muellerzr ran the make quality/lint and also added a smoke test to the test suite for schedule free adam
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Will get back to this soon. Not stale 😅
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@winglian please don´t let it die
thanks for the fixes @tmm1 !
@amyeroberts I addressed your comments. LMK what else is required to push this through!
@winglian It looks like we'll need a bit more iteration on the eval and train method checks as it's causing failing tests atm. For the code quality tests, could you run make fixup and push the changes?
we'll need a bit more iteration on the
evalandtrainmethod checks as it's causing failing tests atm
looks like we need to revisit https://github.com/huggingface/accelerate/pull/2631 which i've done in https://github.com/huggingface/accelerate/pull/3055
could you run
make fixupand push the changes
:+1:
lgtm!
for what it's worth, there was some asking when this PR was originally opened about whether schedulefree's optims were effective on transformers; i did some testing a while ago, and SF adamw had (albiet marginally) a better loss/acc landscape than normal adamw on at least sequence classification finetuning, even after tuning hyperparams.
(purple line is ScheduleFreeAdamW, red line is just AdamW)
@amyeroberts can we kick off the tests again? does the build pick up the latest accelerate release automatically?
Thanks for the PR @winglian!
all the credit goes to @tmm1 for getting this over the line!
If I use this, can I ignore the lr that gets printed out by trainer? (looks like the default linear decay) Or do I need to change the trainer's lr scheduler to constant?