Weber Xie
Weber Xie
Any news about this question? Looking forward to the support of PyTorch.
Thanks @da03 ! My env: - Centos 7 - CUDA 9.1 - PyTorch 1.2
同样问题,期待解决
Met the same problem, anyone on this team can reply this issue?
Also looking forward to this feature.
Thanks for your reply! So the Reward Model will not be updated in the PPO train loop, Is this the standard process of the PPO algorithm? Thanks.
Thanks for your kind explanation! I understand Reward Model is static. Regarding the code implementation of TRLX's ppo_trainer, the policy function and value function are the same model, am I...
From the paper > Learning to summarize from human feedback , it mentions > We initialize the value function to the parameters of > the reward model. In our experiments,...