verl icon indicating copy to clipboard operation
verl copied to clipboard

Support Generative Reward Model (GenRM)

Open maksimstw opened this issue 9 months ago • 8 comments

According to the documentation, veRL only supports AutoModelForSequenceClassification. What would be the best way to implement generative reward model (GenRM) for veRL? I tried looking at FSDP Workers and Megatron-LM Workers but they no longer exist.

maksimstw avatar Feb 09 '25 00:02 maksimstw

Could you give an example of Generative RM and how it should be used?

vermouth1992 avatar Feb 09 '25 02:02 vermouth1992

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

maksimstw avatar Feb 09 '25 03:02 maksimstw

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

vermouth1992 avatar Feb 09 '25 03:02 vermouth1992

+1,looking forward to such a feature.

darkpromise98 avatar Feb 09 '25 14:02 darkpromise98

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

I’m not aware of any existing pretrained GenRM. However, a basic GenRM could be created by simply prompting an LLM to act as a judge. For example, a prompt like this could be used:

Please act as a judge and provide a score from 1 to 5, with 5 being the highest quality. Think step by step before deciding on a score, and output the final score in \box{}.

I’m happy to help implement this feature, but I could use some guidance on the best place to integrate this reward worker.

maksimstw avatar Feb 09 '25 18:02 maksimstw

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

I’m not aware of any existing pretrained GenRM. However, a basic GenRM could be created by simply prompting an LLM to act as a judge. For example, a prompt like this could be used:

Please act as a judge and provide a score from 1 to 5, with 5 being the highest quality. Think step by step before deciding on a score, and output the final score in \box{}.

I’m happy to help implement this feature, but I could use some guidance on the best place to integrate this reward worker.

Hi, would you please introduce some best practices of using this kind of gen rewards? That will be of great thank to you.

zsychina avatar Feb 28 '25 18:02 zsychina

Looking forward to this feature too

Charles20021201 avatar Mar 05 '25 03:03 Charles20021201

+1

shizhediao avatar Mar 06 '25 07:03 shizhediao

Looking forward, thumbs up

jijivski avatar Mar 14 '25 01:03 jijivski

+1. Need it. Does anyone have any knowledge of this?

Cakeyan avatar Apr 28 '25 03:04 Cakeyan

+1. Need it

Wolfwjs avatar May 08 '25 06:05 Wolfwjs

+1. Need it

ghost avatar May 13 '25 12:05 ghost

+1. Need it

JoshonSmith avatar Jun 09 '25 04:06 JoshonSmith

+1. Need it

jianhai0527 avatar Jun 10 '25 03:06 jianhai0527

here's a naive implement of LLM as a Judge which runs as a new role in RL training. we use qwen model as a verifier, it could be replaced with any pretrained GRM as you like https://github.com/volcengine/verl/pull/1953

llm-player-01 avatar Jun 10 '25 18:06 llm-player-01

https://github.com/volcengine/verl/tree/main/recipe/genrm_remote

eric-haibin-lin avatar Jul 24 '25 20:07 eric-haibin-lin