verl Support Generative Reward Model (GenRM)

According to the documentation, veRL only supports AutoModelForSequenceClassification. What would be the best way to implement generative reward model (GenRM) for veRL? I tried looking at FSDP Workers and Megatron-LM Workers but they no longer exist.

Feb 09 '25 00:02 maksimstw

Could you give an example of Generative RM and how it should be used?

Feb 09 '25 02:02 vermouth1992

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Feb 09 '25 03:02 maksimstw

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

Feb 09 '25 03:02 vermouth1992

+1，looking forward to such a feature.

Feb 09 '25 14:02 darkpromise98

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

I’m not aware of any existing pretrained GenRM. However, a basic GenRM could be created by simply prompting an LLM to act as a judge. For example, a prompt like this could be used:

Please act as a judge and provide a score from 1 to 5, with 5 being the highest quality. Think step by step before deciding on a score, and output the final score in \box{}.

I’m happy to help implement this feature, but I could use some guidance on the best place to integrate this reward worker.

Feb 09 '25 18:02 maksimstw

Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832

Are there any pretrained GenRM that we can play around with?

I’m not aware of any existing pretrained GenRM. However, a basic GenRM could be created by simply prompting an LLM to act as a judge. For example, a prompt like this could be used:
Please act as a judge and provide a score from 1 to 5, with 5 being the highest quality. Think step by step before deciding on a score, and output the final score in \box{}.
I’m happy to help implement this feature, but I could use some guidance on the best place to integrate this reward worker.

Hi, would you please introduce some best practices of using this kind of gen rewards? That will be of great thank to you.

Feb 28 '25 18:02 zsychina

Looking forward to this feature too

Mar 05 '25 03:03 Charles20021201

+1

Mar 06 '25 07:03 shizhediao

Looking forward, thumbs up

Mar 14 '25 01:03 jijivski

+1. Need it. Does anyone have any knowledge of this?

Apr 28 '25 03:04 Cakeyan

+1. Need it

May 08 '25 06:05 Wolfwjs

+1. Need it

May 13 '25 12:05 ghost

+1. Need it

Jun 09 '25 04:06 JoshonSmith

+1. Need it

Jun 10 '25 03:06 jianhai0527

here's a naive implement of LLM as a Judge which runs as a new role in RL training. we use qwen model as a verifier, it could be replaced with any pretrained GRM as you like https://github.com/volcengine/verl/pull/1953

Jun 10 '25 18:06 llm-player-01

https://github.com/volcengine/verl/tree/main/recipe/genrm_remote

Jul 24 '25 20:07 eric-haibin-lin