trl icon indicating copy to clipboard operation
trl copied to clipboard

Reward model

Open Jsjgjhg opened this issue 1 year ago • 5 comments

I have a question about the selection of reward models. The loss curve and accuracy on the test set can be certain indicators, but under different training parameters, I can achieve equally low loss and accuracy. How can I make the most favorable checkpoint for PPO training reward models?

Jsjgjhg avatar Sep 11 '24 12:09 Jsjgjhg

Is your question about the metrics to use?

qgallouedec avatar Sep 15 '24 08:09 qgallouedec

Is your question about the metrics to use?

Yes,how to measure whether a reward model is beneficial for PPO training?

Jsjgjhg avatar Sep 18 '24 02:09 Jsjgjhg

Why is indicator ppo/policy/loss always negative?like this: Uploading 1726630796123.png…

Jsjgjhg avatar Sep 18 '24 03:09 Jsjgjhg

I hope to receive your guidance, thank you

Jsjgjhg avatar Sep 18 '24 03:09 Jsjgjhg

Reward model evaluation is a rather broad question. I suggest you take a look at https://huggingface.co/papers/2403.13787

qgallouedec avatar Oct 21 '24 07:10 qgallouedec

Thank you.

Jsjgjhg avatar Nov 06 '24 07:11 Jsjgjhg