llm_optimization icon indicating copy to clipboard operation
llm_optimization copied to clipboard

alpacafarm reward-model-human as gold reward

Open georgao35 opened this issue 8 months ago • 4 comments

Hello, I have found your work and code exetremely helpful! Thank you for your code and work. However, during my usage, I found several point to be confusing and not sure if I am doing right, so I hope you can generously help me.

In particular, I am not sure about some point regarding using Alpacafarm/reward-model-human as gold reward model, as in the paper:

  1. After ppo training, when using Alpacafarm/reward-model-human to assign the gold reward for generated responses, I found i have to set is_alpaca_rm: true instead of false which is originially set in configs/config_rl.json.
  2. In that case, when using alpaca rms, I believe there's a typo in function _parse_entry of src/data_utils/rm_dataset_formatter.py. When using this function, the prompt would only contain the first line "Below is an instruction that describes a task, paired with an ", instead of the whole prompt used in alpaca farm. I fixed it by adding a \ at the end of each line, but I amnot sure if it is the right way.

It is deeply appreciated if you can help me with my problems. Thank you!

georgao35 avatar Jun 21 '24 06:06 georgao35