nebuly
nebuly copied to clipboard
[Chatllama] Training Reward Model on Human Preference Data
Hi! Is there a specific reason that we train the reward model based on absolute scores rather than pairwise human preferences on the same prompts, as most of the other rlhf work?
If you look at openAI papers: Both methods should work. https://arxiv.org/pdf/2009.01325.pdf https://arxiv.org/pdf/2203.02155.pdf However, the latter seems to be used to train InstructGPT models (ie Davinci 3 and possibly chatGPT). So we took training with a trained reward model as our main reference.
Are you more interested in the pairwise preference approach?
Even for InstructGPT, to my understanding, for each prompt, they rank K different outputs which leads to C(K,2) pairs of comparison and train the RM with those pairs. If you look at the RM loss function in the paper (formula 1 on page 8). I think it would be extremely hard for different labelers to be calibrated on absolute scores at labeling time
Hi @TonyZhanghm, thanks for your post. This is a very interesting topic, I tend to agree with you that training on comparisons rather than absolute values is easier. From the openAI paper you can read that their reward model output is a scalar value that is used as a score. "We use the output of the RM as a scalar reward. We tune the supervised policy to optimise this reward using the PPO algorithm." Following this direction, we attempted to train a model that could do essentially the same thing, using DAVINICI to assign scores (for now). The problem with training such a model with comparison arises from the need for a suitable dataset and a model large enough to have a sequence length capable of processing two different 'conversations' to be compared. Since our goal is simplicity and ease of use, we decided to start with a small model trained on pairs {examples, scores}. Obviously, OpenAI has a lot of labels and data on which to fine tune this scalar reward, something we can't achieve in such a short time. More sophisticated approaches might work better, but they need to be feasible without a lot of computation or special hand-crafted data. If you have any suggestions on how to set up the training of a reward model, it would be very helpful and appreciated. We always look forward to profitable conversations and contributors to this project who are as excited as we are about the value proposition we want to bring to the open source community.
@PierpaoloSorbellini Here's my understanding of training RM. You were right the RM is supposed to output a scalar value, which is needed for PPO. Let's call the RM f(x). Also, we have x_win and x_lose for a comparison pair with the same prompt. The objective for training RM would be maximizing f(x_win)-f(x_lose) instead of optimizing f(x) towards a fixed value. This won't need the model to have a longer context length at all and the only difference here would be a change to loss functions.
The SHP and Anthropic dataset mentioned in README both contain pairwise comparison annotations and I also think models like ChatGPT can give you some "LLM Feedback" in the same ways as generating scalar scores.
Let me know what you think!
hihi I also have same concern wiht @TonyZhanghm, seems in instructGPT ect paper they use the pairwise loss when training the reward model. I just found out this reward model training might be use a reference https://github.com/LAION-AI/Open-Assistant/blob/91b6ff24a9216ee7341418c7ccd6e5ce45e40328/model/reward/instructor/trainer.py