Warming up the optimizer states with learning rate = 0 for a few steps

Open awaelchli opened this issue 1 year ago • 4 comments

We could consider doing this trick for finetuning, as it is quite inexpensive. Intuitively it makes sense to me.

https://x.com/StasBekman/status/1762197664454848693?s=20

When finetuning a model has anyone experimented with first running with LR=0 for some 100-1000 iterations to get the optimizer states tuned up and only then restarting again with a normal LR scheduler?

I'm thinking that this would be more efficient since when optim states are random it'll surely first mess up the pretrained weights even with a tiny LR and then will need time to re-correct itself. Starting the weight update with good optim states should save time I think, despite the initial non-stepping steps. And also this is likely to allow for a more aggressive LR scheduler which doesn't need a long warm up.

cc @rasbt

Feb 26 '24 21:02 awaelchli

This sounds interesting, but I would say let's not do that as a default because then it would become difficult to compare to other LLM frameworks. I do like the current warmup/decay we have implemented, which also matches what others are doing (like Llama and OLMo, except OLMo uses a linear instead of cosine decay)

But regarding this idea, this could potentially be an additional option.

Feb 26 '24 23:02 rasbt