DeepSpeedExamples The step2 scoring looks correct but the step3 model is talking gibberish

For the step2 scoring: `python3 training/step2_reward_model_finetuning/rw_eval.py --model_name_or_path output/reward-models/350m/ ==================Eval result============================ prompt: Human: Please tell me about Microsoft in a few sentence? Assistant:

good_ans: Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975

bad_ans: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure.

=============Scores (higher, better)======================== good_ans score: 2.6106441020965576 bad_ans score: 0.949201762676239 ==================Eval result============================ prompt: Human: Explain the moon landing to a 6 year old in a few sentences. Assistant:

good_ans: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another

bad_ans: I don't know, I don't know.

=============Scores (higher, better)======================== good_ans score: 10.038764953613281 bad_ans score: 4.888004302978516 `

for the step3: python3 chat.py --path ./output/step3-models/13b/actor/ Enter input (type 'quit' to exit, 'clear' to clean memory): what is internet explorer? ------------------------------ Round 1 ------------------------------ Human: what is internet explorer? Assistant: mitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemite, yeah, yeah, yeah, yeah, yeah, yeah,,,,,,,,,,,,,,,,

I use the command python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8 to train the model on a HGXnode with 8*80GA100

Do I need to provide the training log, or do I need to change some hyperparameters？

Apr 14 '23 04:04 panganqi

For the step2 scoring: `python3 training/step2_reward_model_finetuning/rw_eval.py --model_name_or_path output/reward-models/350m/ ==================Eval result============================ prompt: Human: Please tell me about Microsoft in a few sentence? Assistant:

good_ans: Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975

bad_ans: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure.

=============Scores (higher, better)======================== good_ans score: 2.6106441020965576 bad_ans score: 0.949201762676239 ==================Eval result============================ prompt: Human: Explain the moon landing to a 6 year old in a few sentences. Assistant:

good_ans: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another

bad_ans: I don't know, I don't know.

=============Scores (higher, better)======================== good_ans score: 10.038764953613281 bad_ans score: 4.888004302978516 `

for the step3: python3 chat.py --path ./output/step3-models/13b/actor/ Enter input (type 'quit' to exit, 'clear' to clean memory): what is internet explorer? ------------------------------ Round 1 ------------------------------ Human: what is internet explorer? Assistant: mitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemitemite, yeah, yeah, yeah, yeah, yeah, yeah,,,,,,,,,,,,,,,,

I use the command python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8 to train the model on a HGXnode with 8*80GA100

Do I need to provide the training log, or do I need to change some hyperparameters？

same problem

why step 3 loss<0?

Apr 14 '23 05:04 xulangping

Same problem. Is something in the training process went wrong?

Apr 16 '23 04:04 Trace2333

Same problem. any update?

Apr 17 '23 08:04 bestpredicts

截屏2023-04-18 10 35 01 same problem, There is an issue with RLHF.

Apr 18 '23 02:04 EdmundYanJ

We are investigating the issue.

Apr 18 '23 16:04 yaozhewei

Otherwise, is it necessary to load weights from SFT in step 2? In this project, the RW model only starts from the pre-trained model, which is different from instructGPT.

Apr 20 '23 03:04 JingerAI

@JingerAI we tried both, there is no big difference. Also, instructGPT paper also mentioned that they tried both and did not see big difference

Apr 20 '23 17:04 yaozhewei

@yaozhewei You are right, many thanks. Additionally, in this project, it appears that the KL penalty is added starting from the last token. Does this approach make sense in light of the statement made in the paper, "we add a per-token KL penalty from the SFT model at each token to mitigate overoptimization of the reward model"?

Apr 21 '23 09:04 JingerAI

@JingerAI please create another issue so we can separate different problems. @panganqi and others, please try the latest branch to see if the accuracy problem is resolved. Also, please take a look at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md for some learnings we had from our exploration.

Apr 24 '23 03:04 yaozhewei

@JingerAI please create another issue so we can separate different problems. @panganqi and others, please try the latest branch to see if the accuracy problem is resolved. Also, please take a look at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md for some learnings we had from our exploration.

All right~

Apr 24 '23 03:04 JingerAI

Closed as no followup

May 05 '23 18:05 yaozhewei

This issue still persists. RLHF produces gibberish when using the step 1 and 2 checkpoints. Are there any updates from your investigation @yaozhewei ?

May 16 '23 04:05 rohansood10

DeepSpeedExamples DeepSpeedExamples copied to clipboard

The step2 scoring looks correct but the step3 model is talking gibberish

DeepSpeedExamples
DeepSpeedExamples copied to clipboard