open_llama
open_llama copied to clipboard
question regarding training stability
trafficstars
I have a question regarding training stability. I downloaded the complete dataset of Redpajama v1 from Hugging Face and followed the parameter settings from the Llama1 paper for data mixture and model tuning. I trained two model sizes, 1.8B and 7B. Unfortunately, the 7B model experienced a rise in loss after 300 billion tokens, and the 1.8B model showed a similar increase after 250 billion tokens. How can I address this issue of training instability?