Mitchell Wortsman
Mitchell Wortsman
Plot for B/32 
Thanks @xiangning-chen will try the other betas, and potentially higher LR when making that change! When raising LR, would you also raise WD?
Thanks, and congrats on the work by the way. Really cool results and quite an interesting optimizer you found!
Ran short (20k iterations) for batch size 16k and H/14 on LAION 2b. Not as much room for hparam tuning as the experiments are compute intensive so still finding lion...
@xiangning-chen thanks for the recommendations! Yes 20k is extremely short but sadly these experiments are already very expensive so don't have much other option. Hmm so you'd say just re-run...
Yep! 5k warmup iterations (linear warmup) then cosine decay. And weight decay for the AdamW baseline is 0.2.
Very interesting, thanks for sharing! A few comments/questions: - For AdamW do you think performance could improve by, e.g., moving away from default beta2 to 0.98 or 0.95. - For...
This is super interesting @xiangning-chen, thanks a lot for the exhaustive exploration! I would not have thought of modifying temperature, how did you think of this? I am really looking...
Potentially, but I'm not totally sure. I think a test would be useful here, i.e., with and without the scaling compared the non-accum baseline.
Thanks for the comments, updated! The default is syncing via `aws s3 sync`.