A few questions for README of stage 3 (RL section)
My questions are mostly for the stage 3, according to the doc https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/README.md it says that
If you don't have step 1 and step 2 models. You may simply try
--actor_model_name_or_path facebook/opt-1.3b --critic_model_name_or_path facebook/opt-350m
My question is that, is the original model like facebook/opt-350m okay for the reward model? Based on my rough understanding, reward model serves as the critic and is going to give the ranking score, so I am confused if the original model like opt-350m without 2nd stage pairwise training would be good or even have the right output format to adapt to the job of reward assignment?
Another one is that the doc emphasized that
When you use above script, please make sure you comment out the following such that it won't load the model weight from previous paths.
applications/DeepSpeed-Chat/training/utils/model/model_utils.py#L60
I do not actually see L60 has anything to do with this instruction.
Hi, this is only for test usage, aka you cannot use it to train a real model
Closed as no followup