nebuly
nebuly copied to clipboard
Input/completion in reward training vs. RL training
Why is it in reward training, the input and completion is appended as:
user_input + " " + completion
(reward.py line 254)
where as in RL training, the equivalent task_response is:
input + "\n" + completion
(trainer.py line 680)?
Hi @menandro, thank you for reporting the mismatch! I think it's just a typo that we need to fix as soon as possible. Would you mind opening a PR to fix this problem? I think it makes much more sense to use the formula we use in reward.py.