Online-RLHF icon indicating copy to clipboard operation
Online-RLHF copied to clipboard

Negative reward when serving ArmoRM-Llama3-8B-v0.1

Open maoliyuan opened this issue 1 year ago • 4 comments

Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of this RM naturally around -2.0, or is it because the way I serve the RM is wrong? (The prompt dataset are also from rlhflow, like "RLHFlow/iterative-prompt-v1-iter7-20K", and the responses are generated from "RLHFlow/LLaMA3-iterative-DPO-final". We also apply the chat template when creating the prompt-response dataset serve-armo-reward-model2 serve-armo-reward-model ) serve-armo-reward-model1

maoliyuan avatar Aug 29 '24 12:08 maoliyuan

Could you try the service example in https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1

WeiXiongUST avatar Aug 30 '24 16:08 WeiXiongUST

Perhaps I got the reason... I built the model from AutoModel.from_pretrained rather than AutoModelForSequenceClassification.from_pretrained, and when I tried the example you gave, the model will output something like this: armo-rm-custom-output It's a correct output and has everything I want. However, when I built from AutoModel.from_pretrained, the model output will become something like this: armo-rm-auto-output Could you please explain the reason behind this? Thanks a lot.

maoliyuan avatar Sep 02 '24 09:09 maoliyuan

You may want to check the document of huggingface about the difference between AutoModel and AutoModel+specified model type.

WeiXiongUST avatar Sep 02 '24 19:09 WeiXiongUST

Thanks a lot! By the way, could you please provide an example that inferences for a batch of input and takes attention mask as an input? The example that you provided in HuggingFace only contains inference for a single input.

maoliyuan avatar Sep 03 '24 06:09 maoliyuan