`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer
Feature request
Remove the implicit assumption in PPOv2Trainer and RLOOTrainer that the reward model & policy model share the same tokenizer
Motivation
Currently, as I understand, PPOv2Trainer and RLOOTrainer both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.
This implicit assumption should be removed.
To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601
Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.
Your contribution
If the developers agree, I'd be happy to discuss this change with them and how to best implement it.
@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR
@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.
This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.
right would be a good improvement i believe...
If someone struggles with this, it can be fixed easily. All you have to do is override the get_reward() function:
First, decode the generated input_ids using the tokenizer of the policy model, then re-tokenize the resulting string using the tokenizer of the reward model, and finally pass that into the reward model for scoring.
(P.S. I genuinely think this should be handled internally by the PPOTrainer ๐คจ)
I also have a strongle demand for this
Hey there,
I can fix this and submit a PR if you guys need it . ๐ง
It's now fixed for RLOO
Is it fixed for all Trainers?
Online DPO an variations (XPO Nash MD): yes GRPO: yes RLOO: yes PPO: no, this needs refactoring
Ok nice!