DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled

Open mousewu opened this issue 10 months ago • 1 comments

settings: actor & critic: OPT 1.3b reward model: OPT 350m GPU: 4 * V100 32G

running script:

ACTOR_MODEL_PATH=$1 CRITIC_MODEL_PATH=$2 ACTOR_ZERO_STAGE=$3 CRITIC_ZERO_STAGE=$4 OUTPUT=$5 if [ "$OUTPUT" == "" ]; then OUTPUT=./output fi if [ "$ACTOR_ZERO_STAGE" == "" ]; then ACTOR_ZERO_STAGE=3 fi if [ "$CRITIC_ZERO_STAGE" == "" ]; then CRITIC_ZERO_STAGE=3 fi

if [ "$ACTOR_MODEL_PATH" == "" ]; then ACTOR_MODEL_PATH=AdamG012/chat-opt-1.3b-sft-deepspeed fi if [ "$CRITIC_MODEL_PATH" == "" ]; then CRITIC_MODEL_PATH=AdamG012/chat-opt-350m-reward-deepspeed fi

echo "Step3: ACTOR_MODEL_PATH=$ACTOR_MODEL_PATH CRITIC_MODEL_PATH=$CRITIC_MODEL_PATH ACTOR_ZERO_STAGE=$ACTOR_ZERO_STAGE CRITIC_ZERO_STAGE=$CRITIC_ZERO_STAGE OUTPUT=$OUTPUT"

mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related Actor_Lr=9.65e-6 Critic_Lr=5e-6

deepspeed --master_port 12346 main.py
--data_path Dahoas/rm-static
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_generation_batch_size 1
--per_device_training_batch_size 1
--generation_batches 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--num_train_epochs 2
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_dropout 0.0
--num_warmup_steps 100
--deepspeed --seed 1234
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--enable_ema
--output_dir $OUTPUT
--enable_tensorboard
--tensorboard_path $OUTPUT
&> $OUTPUT/training.log

The log is below: --- prompt --> step=272, rank=1, ['\n\nHuman: How can I train for running a marathon?\n\nAssistant: Gosh! I guess I could give you lots of very detailed advice, but I’m not sure that’s the best idea. That’s a pretty rigorous training program! If you want to check in with me every few weeks, I could share some of the ideas that might be helpful in your training. Do you have a pace in mind that you’re trying to get to?\n\nHuman: No, I just want to be prepared and get in better shape. The marathon is about 4 months from now.\n\nAssistant: Maybe focus on just running more regularly for now? If you just get in the habit of running, you’ll start feeling stronger and faster, and once you get used to it, the distance of a marathon will feel relatively easy.\n\nHuman: That makes sense. I will improve my conditioning if I just make it a habit to run every day.\n\nAssistant:'] --- prompt --> step=272, rank=0, ['\n\nHuman: WHat can I use witchhazel for?\n\nAssistant: It’s a multipurpose natural remedy that’s also a common household product. It’s used to soothe sore muscles and joints, as well as a facial wash, mouthwash, and hair rinse. Some people use it topically for inflammation and itching. And it’s also used in many natural cleaning products and body care products.\n\nHuman: How do you use it for sore muscles?\n\nAssistant:']--- prompt --> step=272, rank=3, ['\n\nHuman: How can I stop my vomiting bout after food poisoning?\n\nAssistant: I’m sorry you’ve been feeling sick. Is there anything you think you can do to keep vomiting? Would eating small frequent meals work?\n\nHuman: Oh, that would probably work. What should I be eating?\n\nAssistant: Any food is a good choice, to keep you from getting too hungry or dehydrated. You could try eating mostly salty foods like broth, juice, soda, and bread.\n\nHuman: Are you sure soda is a good idea?\n\nAssistant:'] --- prompt --> step=272, rank=2, ['\n\nHuman: Please tell me how to make brownies.\n\nAssistant:'] --- ans --> step=272, rank=1, [' I<|endoftext|>']

--- ans --> step=272, rank=3, ['<|endoftext|>'] --- ans --> step=272, rank=0, ['<|endoftext|>'] --- ans --> step=272, rank=2, ['<|endoftext|>'] Epoch: 0 | Step: 272 | PPO Epoch: 1 | Actor Loss: -2.625 | Critic Loss: 3.69140625 | Unsupervised Loss: 0.0 End-to-End => Latency: 3.25s, TFLOPs: 2.03, Samples/sec: 1.23, Time/seq 0.81s, Batch Size: 4, Total Seq. Length: 512 Generation => Latency: 1.97s, Per-token Latency 7.71 ms, TFLOPs: 0.69, BW: 341.33 GB/sec, Answer Seq. Length: 256 Training => Latency: 1.27s, TFLOPs: 4.10 Actor Model Parameters => 1.316 B, Critic Model Parameters => 0.331 B Average reward score: -11.5546875 | EMA reward score: -11.419149377686367

mousewu avatar Apr 09 '24 07:04 mousewu

你好,请问你解决了吗

ouyanmei avatar Aug 20 '24 07:08 ouyanmei