Phil Wang
Phil Wang
> So far at small scale (short B/32 run, batch size 16k), well tuned lion slightly outperforms AdamW (still tuning AdamW). > > AdamW (LR 2e-3, WD 0.2, betas=0.9, 0.95)...
@mitchellnw wanted to thank you for running and sharing this btw! honestly, i was on the fence about this technique, but now i believe it should be used in the...
@xiangning-chen 👋 just heard another positive result this morning from someone trustworthy! 💯 while you are here, have you figured out which learning rate scheduler is optimal with Lion? it...
@xiangning-chen thank you for your recommendation!
yea, i'm starting to hear more negative reports coming in unfortunately. the common story i hear is that it converges faster, but generalizes worse
@xiangning-chen oh this is really interesting what initial temperature value did the contrastive learning networks (LiT and BASIC) you tested on have?
@iejMac nice! i can contribute to this i believe for video, we can do much more aggressive patch dropout in the beginning. well, if the video does not resemble [this](https://www.youtube.com/watch?v=a2v7JK8c2fk)...

@iejMac nice! i'll do a code review later this week when i find some downtime
@rwightman sure, by modality, or by functionality, or both, either way is fine just let me know