Xiangning Chen
Xiangning Chen
Hey, have the sample_weight problem be solved now? Thanks
@mitchellnw Thanks for the experiments! I observe that you used betas=0.9, 0.95 for AdamW compared to the default betas=0.9, 0.999. While for Lion it's still the default one betas=0.9, 0.99,...
@lucidrains Thanks for sharing the good news! We always used the same learning rate schedule as AdamW in our experiments including cosine decay, linear decay, and constant (all with 10K...
@mitchellnw Thank you! Actually I would decrease the WD when raising the LR to maintain the effective weight decay strength LR*WD.
Thanks for the experiments! Can I know the learning rate schedule and warmup iterations. @mitchellnw Also, the wd is still 0.2 for AdamW right? What about lr=4e-4, betas=(0.95, 0.98), wd=1.0...
@lucidrains May I know on what domains they observe a faster convergence but worse generalization? Thanks! To me, this appears to be a consequence of using a small learning rate,...
@mitchellnw May I know the warmup iterations and learning rate schedule? Thanks!
@mitchellnw Thanks for the information! I quickly tested on training ViT-B/16 with 20k steps and batch size 4,096. I will also try on CLIP training as well. Here are the...
> For AdamW do you think performance could improve by, e.g., moving away from default beta2 to 0.98 or 0.95. If with the same learning rate, I don't think this...
@mitchellnw I have some updates to share! I discovered that the ***initial temperature value*** has an impact, and tuning it has resulted in better performance for Lion, compared to AdamW....