llm_optimization
llm_optimization copied to clipboard
alpacafarm reward-model-human as gold reward
Hello, I have found your work and code exetremely helpful! Thank you for your code and work. However, during my usage, I found several point to be confusing and not sure if I am doing right, so I hope you can generously help me.
In particular, I am not sure about some point regarding using Alpacafarm/reward-model-human as gold reward model, as in the paper:
- After ppo training, when using Alpacafarm/reward-model-human to assign the gold reward for generated responses, I found i have to set
is_alpaca_rm: true
instead of false which is originially set inconfigs/config_rl.json
. - In that case, when using alpaca rms, I believe there's a typo in function
_parse_entry
ofsrc/data_utils/rm_dataset_formatter.py
. When using this function, the prompt would only contain the first line "Below is an instruction that describes a task, paired with an ", instead of the whole prompt used in alpaca farm. I fixed it by adding a \ at the end of each line, but I amnot sure if it is the right way.
It is deeply appreciated if you can help me with my problems. Thank you!