DeepSeek-Coder icon indicating copy to clipboard operation
DeepSeek-Coder copied to clipboard

Training loss extremely noisy during fine-tuning and randomly goes to 0

Open zpx01 opened this issue 1 year ago • 2 comments

I'm trying to fine-tune the 6.7B model on my own code dataset. I am running a multinode training with fp32 precision on NVIDIA Tesla V100 GPUs with DeepSpeed ZeRO Stage 3. My training loss seems to randomly fluctuate and go down to zero, I've attached my training loss graph below:

Screenshot 2024-01-25 at 10 19 48 PM

I'm running this on 128 GPUs with a train batch size of 1 per device and no gradient accumulation. I'm not sure what could be the cause of this as I haven't seen this happen with other models with the Llama architecture. Would appreciate any general direction to help debug this, thanks!

zpx01 avatar Jan 26 '24 06:01 zpx01

@DejianYang @pkuzqh Would appreciate any help on this ticket, thanks

zpx01 avatar Feb 06 '24 04:02 zpx01