DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Step 3 failed for customized GPT

Open ruihan0495 opened this issue 1 year ago • 1 comments

Dear all,

When we use our customized GPT model for step 3 training, we get the kernel execution error. (m: 5120, n: 8, k: 1706, error: 14), however once we turned off the enable-hybrid-engine, the training process works fine. The kernel error is rooted at the generate_experience function image Anyone can help?

ruihan0495 avatar Apr 17 '23 13:04 ruihan0495

All right, in addition to this, after we turned off enable-hybrid-engine, we encounter another error "Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run." The script we used for training is:

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=/data/.
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path CutomizedDataset \
   --data_split 2,4,4 \
   --actor_model_name_or_path /path/to/file \
   --critic_model_name_or_path /path/to/file \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 4 \
   --per_device_mini_train_batch_size 4 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 10 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 2 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --actor_gradient_checkpointing \
   --inference_tp_size 1 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

ruihan0495 avatar Apr 18 '23 06:04 ruihan0495

Hi @ruihan0495, I meet the same problem in stage 2. I am using bloomz and my own dataset. Have you figured out how to solve this problem?

LuciusMos avatar Apr 24 '23 07:04 LuciusMos

Dear @LuciusMos , We found that our problem was due to the tokenizer encoding process. Say our tokenizer has maximum length of 6666, but sometimes it encode some input strings as 6668, which exceeds our tokenizer vocab size. We solved this by mapping the outranged input ids to [UNK]. I hope this helps :)

ruihan0495 avatar Apr 24 '23 08:04 ruihan0495

是在data_utils.py中,在train_phase=3下修改吗?比如 tokenizer(prompt, return_tensors="pt", max_length=max_seq_len, padding="max_length", truncation=True) 我这边的运行日志中有一些act loss和cti loss是inf或者非常大的数值,然后就会报 Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 这个错误 epoch: 0|step: 46|ppo_ep: 1|act_loss: inf|cri_loss: inf|unsuper_loss: 0.0

请教下,这个问题怎么解决?

BaiStone2017 avatar Apr 25 '23 03:04 BaiStone2017

可以在main.py组batch的时候做。 这个loss是inf可能是有bug了 如果排查了没有bug的话 试着调一调lr和batch size,还可以调一下ds_config里面的训练精度之类的

ruihan0495 avatar Apr 25 '23 09:04 ruihan0495

是在data_utils.py中,在train_phase=3下修改吗?比如 tokenizer(prompt, return_tensors="pt", max_length=max_seq_len, padding="max_length", truncation=True) 我这边的运行日志中有一些act loss和cti loss是inf或者非常大的数值,然后就会报 Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 这个错误 epoch: 0|step: 46|ppo_ep: 1|act_loss: inf|cri_loss: inf|unsuper_loss: 0.0

请教下,这个问题怎么解决?

这个问题有解决方法不?我也有这个问题

enbacoo avatar Apr 26 '23 06:04 enbacoo

这个问题暂时没有解决,但是我后来尝试使用官方demo,不修改数据和模型,使用single_node方式,把官方demo跑完了,没有出现error,只有在step2的时候,最后出现了Epoch 1/1 with loss inf。 现在感觉问题和 @ruihan0495 说差不多,ppo训练过程中的问题,和batch_size、数据、学习率等参数都有关系,很不稳定

BaiStone2017 avatar Apr 27 '23 00:04 BaiStone2017

I use my custom dataset to train, step 1 and 2 are ok, but face the similar problem in step 3, so what you have actually done to resolve the problem? @ruihan0495 , any code changes suggestion would be appreciated

mayuanyang avatar Apr 29 '23 13:04 mayuanyang

We tried using BF16 instead of FP16 during training, though there is no more loss scale error, the training is very unstable. So the best way is still stick to FP16 training. Unfortunately, the loss scale error is indeed a side effect of this setting for large models.

ruihan0495 avatar May 04 '23 01:05 ruihan0495

按照你这个说法是不是Actor和Critic必须是同样的分词模型才可以?我使用了llama作为Actor,opt-350m作为Critic,就出现了这个问题。如果Actor改使用opt-1.3b就是正常的。

devinzhang91 avatar Aug 21 '23 07:08 devinzhang91