lm-human-preferences
lm-human-preferences copied to clipboard
How to liberate the gpt2 from reference model?
Hi,
We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks. How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint. I am new to PPO. Hoping for some suggestions.
Thanks.