Leandro von Werra comments

Results 160 comments of


                                            Leandro von Werra

Why is the Reward Model not updated of the ppo_trainer ?

The reward model has to be trained before the RL/PPO loop. That's why it's not part of the trainer. However, there is a script to train a reward model in...

Why is the Reward Model not updated of the ppo_trainer ?

The reward model is not part of the PPO definition. PPO assumes an environment that emits a reward based on actions. In RLHF we simulate the human preference with a...

Why is the Reward Model not updated of the ppo_trainer ?

> We initialize the value function to the parameters of the reward model. In our experiments, the reward model, policy, and value function are the same size. Indeed, in our...

Entropy continually increases throughout the training

As far as I can tell from the paper the entropy bonus is optional and not used in the experiments (see section 6.1). To avoid the model from generating gibberish...

Entropy continually increases throughout the training

Hard to know what could be the issue without a minimal example and some logs. Can you share a bit more?

Entropy continually increases throughout the training

Closing this for now, feel free to re-open if there's an update.

Increasing PPOconfig training steps does not increase the number of training iterations

Indeed, the total_ppo_epochs in the config are deprecated. If you want to train for multiple epochs it's best to add an additional for-loop around the dataloader: ```python for epoch in...

Adding support for constitution

Sounds really cool! Have you been able to test it already? If you have a working example then we can add it as an example! This might also be interesting...

Adding support for constitution

Closing this for now - feel free to reopen if there's an update!

PPO Questions

Closing this for now, feel free to reopen if there's an update.