trl icon indicating copy to clipboard operation
trl copied to clipboard

Clarification on reward/value heads in PPOV2

Open SalmanMohammadi opened this issue 7 months ago • 3 comments

First, thank you for your efforts in helping to bring accurate and performant RLHF techniques to the open-source community. I'm raising this issue hoping to get some clarification on a couple implementation details in PPOV2:

--- 1 --- The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear. In a recent fork for training reward models, and alongside the suggestion in The N Implementation Details, the bias is correctly initialised prior to reward model training.

However, when I run the snippet from examples/scripts/ppo/ppo.py for an exemplar RM:

# Load model directly
from transformers import AutoModelForSequenceClassification

reward_model = AutoModelForSequenceClassification.from_pretrained("trl-internal-testing/rm_descriptiveness_1b")

""" output:
config.json: 100%
 869/869 [00:00<00:00, 12.0kB/s]
model.safetensors: 100%
 3.64G/3.64G [02:49<00:00, 25.7MB/s]

Some weights of the model checkpoint at trl-internal-testing/rm_descriptiveness_1b were not used when initializing GPTNeoXForSequenceClassification: ['score.bias']
- This IS expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
"""
sd = model.state_dict().items()
"score.bias" in sd
"""
False
"""

Is this expected behaviour - to not use the bias during PPO training?

--2--

In the previous PPO implementation, the value head model is simply another head which shares the base model backbone, however, in PPOV2, it seems the value model is instantiated separately. Is my understanding correct here? If so, I'm curious about the reasoning behind this, since a separate value model would require an additional reward-model-size memory capacity. Do you see an improvment in algorithm performance here?

Many thanks!

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune https://github.com/pytorch/torchtune/pull/1005/, and I've found all the empirical work and implementation details invaluable so far.

SalmanMohammadi avatar Jun 27 '24 15:06 SalmanMohammadi