Warming up the optimizer states with learning rate = 0 for a few steps
We could consider doing this trick for finetuning, as it is quite inexpensive. Intuitively it makes sense to me.
https://x.com/StasBekman/status/1762197664454848693?s=20
When finetuning a model has anyone experimented with first running with LR=0 for some 100-1000 iterations to get the optimizer states tuned up and only then restarting again with a normal LR scheduler?
I'm thinking that this would be more efficient since when optim states are random it'll surely first mess up the pretrained weights even with a tiny LR and then will need time to re-correct itself. Starting the weight update with good optim states should save time I think, despite the initial non-stepping steps. And also this is likely to allow for a more aggressive LR scheduler which doesn't need a long warm up.
cc @rasbt
This sounds interesting, but I would say let's not do that as a default because then it would become difficult to compare to other LLM frameworks. I do like the current warmup/decay we have implemented, which also matches what others are doing (like Llama and OLMo, except OLMo uses a linear instead of cosine decay)
But regarding this idea, this could potentially be an additional option.