ru-gpts icon indicating copy to clipboard operation
ru-gpts copied to clipboard

Deepspeed training overflows

Open drunkinlove opened this issue 3 years ago • 4 comments

Hi! Thanks for replying to my earlier issues :)

I'm currently trying to finetune a model with deepspeed using scripts/deepspeed_gpt3_medium.sh as an example. After a while (usually 16k steps) training basically hangs with the following message repeated:

1622513061366 localhost info [2021-06-01 05:04:21,527] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0.0, reducing to 0.0

Meaning the weight updates are too large and training failed to converge, right? I've also tried setting a lower LR (as in deepspeed_gpt3_xl_finetune.sh), but the dynamic is the same.

Have you run into this problem at any point? I'd appreciate any advice.

drunkinlove avatar Jun 03 '21 13:06 drunkinlove

try to change deepspeed config. also check your batch size and check matching to deepspeed config. but loss scale is usually on start of training. at the end of training this may be overfitting. try to evaluate older checkpoints on some test tasks.

king-menin avatar Jun 08 '21 14:06 king-menin

what perplexity of model do you have on 16k step?

king-menin avatar Jun 08 '21 14:06 king-menin

Sometimes this issue arises during fp16 training. We recommend to:

  1. Try to decrease learning rate 2-4 times and resume training from last saved successful step.
  2. If it doesn't help, you could try to resume training in fp32 mode for a few thousand steps. It would be slower and possibly require to decrease batch size to fit in memory. Hope it helps!

ollmer avatar Jun 08 '21 16:06 ollmer

@king-menin Perplexity at step 16k is around 160. I found train_micro_batch_size_per_gpu=4 in the deepspeed config, is it supposed to equal the batch-size arg for pretrain_gpt3.py? Also, isn't overfitting supposed to lead to loss (and weight) stability? Whereas my model seems to get huge updates that overflow in FP16...

@ollmer I'd like to try that, but when I load a deepspeed checkpoint I get the following error:

File "/home/user/miniconda3/envs/gpt_train/lib/python3.8/site-packages/torch/serialization.py", line 831, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
OSError: [Errno 14] Bad address

would you happen to know how to fix this?

drunkinlove avatar Jun 21 '21 14:06 drunkinlove