Reward model
I have a question about the selection of reward models. The loss curve and accuracy on the test set can be certain indicators, but under different training parameters, I can achieve equally low loss and accuracy. How can I make the most favorable checkpoint for PPO training reward models?
Is your question about the metrics to use?
Is your question about the metrics to use?
Yes,how to measure whether a reward model is beneficial for PPO training?
Why is indicator ppo/policy/loss always negative?like this:
I hope to receive your guidance, thank you
Reward model evaluation is a rather broad question. I suggest you take a look at https://huggingface.co/papers/2403.13787
Thank you.