lm-human-preferences icon indicating copy to clipboard operation
lm-human-preferences copied to clipboard

How to liberate the gpt2 from reference model?

Open yanan1116 opened this issue 4 years ago • 0 comments

Hi,

We know that KL is used in the loss as a constraint for the difference between the original gpt2 and the active gpt2 which produces responses for rewards feedbacks. How can I can tune the parameters to mitigate this constraint? I mean I want the active gpt2 can deviate much from the original reference gpt2, as I find in my experiments that the rewards do not improve as expected, possibly due to this constraint. I am new to PPO. Hoping for some suggestions.

Thanks.

yanan1116 avatar Aug 03 '21 10:08 yanan1116