Mitchell Wortsman comments

Results 88 comments of


                                            Mitchell Wortsman

[WIP] Testing the lion optimizer

Plot for B/32 ![loss_plot](https://user-images.githubusercontent.com/41302489/219971846-2246f5d7-7012-454b-83a9-b7e1a183c0e6.png)

[WIP] Testing the lion optimizer

Thanks @xiangning-chen will try the other betas, and potentially higher LR when making that change! When raising LR, would you also raise WD?

[WIP] Testing the lion optimizer

Thanks, and congrats on the work by the way. Really cool results and quite an interesting optimizer you found!

[WIP] Testing the lion optimizer

Ran short (20k iterations) for batch size 16k and H/14 on LAION 2b. Not as much room for hparam tuning as the experiments are compute intensive so still finding lion...

[WIP] Testing the lion optimizer

@xiangning-chen thanks for the recommendations! Yes 20k is extremely short but sadly these experiments are already very expensive so don't have much other option. Hmm so you'd say just re-run...

[WIP] Testing the lion optimizer

Yep! 5k warmup iterations (linear warmup) then cosine decay. And weight decay for the AdamW baseline is 0.2.

[WIP] Testing the lion optimizer

Very interesting, thanks for sharing! A few comments/questions: - For AdamW do you think performance could improve by, e.g., moving away from default beta2 to 0.98 or 0.95. - For...

[WIP] Testing the lion optimizer

This is super interesting @xiangning-chen, thanks a lot for the exhaustive exploration! I would not have thought of modifying temperature, how did you think of this? I am really looking...

Gradient accumulation may requires scaling before backward

Potentially, but I'm not totally sure. I think a test would be useful here, i.e., with and without the scaling compared the non-accum baseline.

Utility for sycning with s3 and loading checkpoints from s3

Thanks for the comments, updated! The default is syncing via `aws s3 sync`.