DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Step 3 failed for customized GPT
Dear all,
When we use our customized GPT model for step 3 training, we get the kernel execution error. (m: 5120, n: 8, k: 1706, error: 14), however once we turned off the enable-hybrid-engine, the training process works fine. The kernel error is rooted at the generate_experience function
Anyone can help?
All right, in addition to this, after we turned off enable-hybrid-engine, we encounter another error "Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run." The script we used for training is:
# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=/data/.
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=9.65e-6
Critic_Lr=5e-6
deepspeed --master_port 12346 main.py \
--data_path CutomizedDataset \
--data_split 2,4,4 \
--actor_model_name_or_path /path/to/file \
--critic_model_name_or_path /path/to/file \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 4 \
--per_device_mini_train_batch_size 4 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 10 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 2 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--actor_gradient_checkpointing \
--inference_tp_size 1 \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
Hi @ruihan0495, I meet the same problem in stage 2. I am using bloomz and my own dataset. Have you figured out how to solve this problem?
Dear @LuciusMos , We found that our problem was due to the tokenizer encoding process. Say our tokenizer has maximum length of 6666, but sometimes it encode some input strings as 6668, which exceeds our tokenizer vocab size. We solved this by mapping the outranged input ids to [UNK]. I hope this helps :)
是在data_utils.py中,在train_phase=3下修改吗?比如
tokenizer(prompt, return_tensors="pt", max_length=max_seq_len, padding="max_length", truncation=True)
我这边的运行日志中有一些act loss和cti loss是inf或者非常大的数值,然后就会报 Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 这个错误
epoch: 0|step: 46|ppo_ep: 1|act_loss: inf|cri_loss: inf|unsuper_loss: 0.0
请教下,这个问题怎么解决?
可以在main.py组batch的时候做。 这个loss是inf可能是有bug了 如果排查了没有bug的话 试着调一调lr和batch size,还可以调一下ds_config里面的训练精度之类的
是在data_utils.py中,在train_phase=3下修改吗?比如
tokenizer(prompt, return_tensors="pt", max_length=max_seq_len, padding="max_length", truncation=True)
我这边的运行日志中有一些act loss和cti loss是inf或者非常大的数值,然后就会报 Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 这个错误epoch: 0|step: 46|ppo_ep: 1|act_loss: inf|cri_loss: inf|unsuper_loss: 0.0
请教下,这个问题怎么解决?
这个问题有解决方法不?我也有这个问题
这个问题暂时没有解决,但是我后来尝试使用官方demo,不修改数据和模型,使用single_node方式,把官方demo跑完了,没有出现error,只有在step2的时候,最后出现了Epoch 1/1 with loss inf
。
现在感觉问题和 @ruihan0495 说差不多,ppo训练过程中的问题,和batch_size、数据、学习率等参数都有关系,很不稳定
I use my custom dataset to train, step 1 and 2 are ok, but face the similar problem in step 3, so what you have actually done to resolve the problem? @ruihan0495 , any code changes suggestion would be appreciated
We tried using BF16 instead of FP16 during training, though there is no more loss scale error, the training is very unstable. So the best way is still stick to FP16 training. Unfortunately, the loss scale error is indeed a side effect of this setting for large models.
按照你这个说法是不是Actor和Critic必须是同样的分词模型才可以?我使用了llama作为Actor,opt-350m作为Critic,就出现了这个问题。如果Actor改使用opt-1.3b就是正常的。