transformers icon indicating copy to clipboard operation
transformers copied to clipboard

schedulefree optimizers

Open winglian opened this issue 1 year ago • 18 comments

What does this PR do?

integrates meta's https://github.com/facebookresearch/schedule_free for adamw & sgd

https://twitter.com/aaron_defazio/status/1776320004465582331

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

@muellerzr @younesbelkada @pacman100

winglian avatar Apr 06 '24 03:04 winglian

FYI this will need https://github.com/huggingface/accelerate/pull/2631 as we need to upstream accelerate's ability to call train/eval on a wrapped optimizer

muellerzr avatar Apr 06 '24 11:04 muellerzr

Some thoughts:

  • I was trying to ask Aaron et al on Twitter if they did any transformer experiments, but to no avail. They said a paper will come in 1 or 2 months.
  • Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125
  • Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
  • A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.

I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-fiasco/ wasn't pleasant.

danielhanchen avatar Apr 07 '24 02:04 danielhanchen

Should be very easy to test this on Phi-2 or TinyLlama when the implementation works?

PhilipMay avatar Apr 07 '24 06:04 PhilipMay

This PR should maybe also add a few lines to the README about "how to use this".

PhilipMay avatar Apr 08 '24 10:04 PhilipMay

We've merged the accelerate portion in, so if anyone is trying this out in distributed fashions, you can do pip install git+https://github.com/huggingface/accelerate :)

muellerzr avatar Apr 08 '24 15:04 muellerzr

There is any chance of this making into the main branch? I and other confirmed that the results are real. Thank you @winglian

bratao avatar Apr 14 '24 16:04 bratao

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Is their any remaining work I could contribute towards getting this PR merged?

Cheers

CoffeeVampir3 avatar May 09 '24 05:05 CoffeeVampir3

@pacman100 @muellerzr @younesbelkada Can we get a new review to get this merged? Since the last check, I rebased, added some fixes and docs.

winglian avatar May 31 '24 18:05 winglian

@muellerzr ran the make quality/lint and also added a smoke test to the test suite for schedule free adam

winglian avatar Jun 01 '24 01:06 winglian

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 28 '24 08:06 github-actions[bot]

Will get back to this soon. Not stale 😅

winglian avatar Jun 28 '24 11:06 winglian

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 23 '24 08:07 github-actions[bot]

@winglian please don´t let it die

bratao avatar Jul 31 '24 11:07 bratao

thanks for the fixes @tmm1 !

winglian avatar Aug 20 '24 23:08 winglian

@amyeroberts I addressed your comments. LMK what else is required to push this through!

tmm1 avatar Aug 20 '24 23:08 tmm1

@winglian It looks like we'll need a bit more iteration on the eval and train method checks as it's causing failing tests atm. For the code quality tests, could you run make fixup and push the changes?

amyeroberts avatar Aug 27 '24 16:08 amyeroberts

we'll need a bit more iteration on the eval and train method checks as it's causing failing tests atm

looks like we need to revisit https://github.com/huggingface/accelerate/pull/2631 which i've done in https://github.com/huggingface/accelerate/pull/3055

could you run make fixup and push the changes

:+1:

tmm1 avatar Aug 27 '24 16:08 tmm1

lgtm! for what it's worth, there was some asking when this PR was originally opened about whether schedulefree's optims were effective on transformers; i did some testing a while ago, and SF adamw had (albiet marginally) a better loss/acc landscape than normal adamw on at least sequence classification finetuning, even after tuning hyperparams. Screenshot_20240830-084839 (purple line is ScheduleFreeAdamW, red line is just AdamW)

fizzAI avatar Aug 30 '24 12:08 fizzAI

@amyeroberts can we kick off the tests again? does the build pick up the latest accelerate release automatically?

tmm1 avatar Sep 06 '24 19:09 tmm1

Thanks for the PR @winglian!

all the credit goes to @tmm1 for getting this over the line!

winglian avatar Sep 11 '24 14:09 winglian

If I use this, can I ignore the lr that gets printed out by trainer? (looks like the default linear decay) Or do I need to change the trainer's lr scheduler to constant?

llllvvuu avatar Sep 17 '24 13:09 llllvvuu