Leandro von Werra
Leandro von Werra
The reward model has to be trained before the RL/PPO loop. That's why it's not part of the trainer. However, there is a script to train a reward model in...
The reward model is not part of the PPO definition. PPO assumes an environment that emits a reward based on actions. In RLHF we simulate the human preference with a...
> We initialize the value function to the parameters of the reward model. In our experiments, the reward model, policy, and value function are the same size. Indeed, in our...
As far as I can tell from the paper the entropy bonus is optional and not used in the experiments (see section 6.1). To avoid the model from generating gibberish...
Hard to know what could be the issue without a minimal example and some logs. Can you share a bit more?
Closing this for now, feel free to re-open if there's an update.
Indeed, the total_ppo_epochs in the config are deprecated. If you want to train for multiple epochs it's best to add an additional for-loop around the dataloader: ```python for epoch in...
Sounds really cool! Have you been able to test it already? If you have a working example then we can add it as an example! This might also be interesting...
Closing this for now - feel free to reopen if there's an update!
Closing this for now, feel free to reopen if there's an update.