LOMO
LOMO copied to clipboard
Is LOMO capable of pre-training a LLM from scratch as well?
Good question, we don't know how LOMO will perform in the pre-training stage. The major concern is that the SGD is sensitive to the optimization settings. My guess is that the optimization process of pre-training from scratch is more difficult compared to fine-tuning or further pre-training.
In practice, a controversial solution is using a powerful optimizer (e.g., Adam) for a warm-up and switching to a cheaper optimizer (e.g., LOMO).
I am doing further pre-training, reply later when the result comes out.