TinyLlama
TinyLlama copied to clipboard
A potential bug in multi-GPU training
Hi,
I found the following strange phenomena when running tiny llama pretraining.
- When using multiple GPUs, I got completely different results when running the same code twice. Further, many loss spike occurs. See the example for 2-card training. I use all the default settings except that I shrink the learning rate from 4e-4 to 2e-4 and batchsize from 1024 to 512.
AdamW 2-card: run1
wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/83b8yfjz
AdamW 2-card: run2
wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/8p6axrgw
Two runs are totally different and the training fails.
- When simply changing the above settings to single GPU, these issues do not occur. Two runs are mostly the same (with slight difference though) and the loss decreases stably without any spikes.
AdamW 1-card: run 1
wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/kdg2qmj8
AdamW 1-card: run 2
wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/vh23qd0u
Two runs are mostly the same and the loss decreases stably.
Do you encounter a similar issue? Any idea why?