Weber Xie

Results 8 comments of Weber Xie

Any news about this question? Looking forward to the support of PyTorch.

Thanks @da03 ! My env: - Centos 7 - CUDA 9.1 - PyTorch 1.2

Met the same problem, anyone on this team can reply this issue?

Thanks for your reply! So the Reward Model will not be updated in the PPO train loop, Is this the standard process of the PPO algorithm? Thanks.

Thanks for your kind explanation! I understand Reward Model is static. Regarding the code implementation of TRLX's ppo_trainer, the policy function and value function are the same model, am I...

From the paper > Learning to summarize from human feedback , it mentions > We initialize the value function to the parameters of > the reward model. In our experiments,...