trl icon indicating copy to clipboard operation
trl copied to clipboard

`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer

Open RylanSchaeffer opened this issue 1 year ago โ€ข 4 comments

Feature request

Remove the implicit assumption in PPOv2Trainer and RLOOTrainer that the reward model & policy model share the same tokenizer

Motivation

Currently, as I understand, PPOv2Trainer and RLOOTrainer both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.

This implicit assumption should be removed.

To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601

Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.

Your contribution

If the developers agree, I'd be happy to discuss this change with them and how to best implement it.

RylanSchaeffer avatar Aug 26 '24 19:08 RylanSchaeffer

@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR

RylanSchaeffer avatar Aug 28 '24 13:08 RylanSchaeffer

@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.

RylanSchaeffer avatar Sep 08 '24 18:09 RylanSchaeffer

This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.

qgallouedec avatar Oct 20 '24 17:10 qgallouedec

right would be a good improvement i believe...

kashif avatar Oct 20 '24 17:10 kashif

If someone struggles with this, it can be fixed easily. All you have to do is override the get_reward() function:

First, decode the generated input_ids using the tokenizer of the policy model, then re-tokenize the resulting string using the tokenizer of the reward model, and finally pass that into the reward model for scoring.

(P.S. I genuinely think this should be handled internally by the PPOTrainer ๐Ÿคจ)

TheTahaaa avatar Apr 12 '25 21:04 TheTahaaa

I also have a strongle demand for this

RekkimiARG avatar Apr 25 '25 11:04 RekkimiARG

Hey there,

I can fix this and submit a PR if you guys need it . ๐Ÿง

TheTahaaa avatar Aug 29 '25 21:08 TheTahaaa

It's now fixed for RLOO

qgallouedec avatar Aug 29 '25 21:08 qgallouedec

Is it fixed for all Trainers?

RylanSchaeffer avatar Aug 29 '25 21:08 RylanSchaeffer

Online DPO an variations (XPO Nash MD): yes GRPO: yes RLOO: yes PPO: no, this needs refactoring

qgallouedec avatar Aug 29 '25 23:08 qgallouedec

Ok nice!

RylanSchaeffer avatar Aug 29 '25 23:08 RylanSchaeffer