liushixuan
liushixuan
> 好的,我回顾了一下这个陈年项目,可以解答这个问题。 > > 可以从uf/task/base.py文件中检索关键词“self.init_tilda_op”看到,这个op实际上在整个训练流程里,只在训练开始时作为初始化运行一次,并不是每个epoch开头都运行一次。这里的“runs at the start of each epoch”是我注释错了,下个版本我会纠正过来,改为“runs at the start of traning”。 > > 谢谢你的观察~ 谢谢你的解答~
Hi, do you have any plans on supporting Pipeline parallelism now?
> 计算response长度;如果实现不符合预期就是bug 感谢您的解答!我想再问一下,对response计算rrhf_loss和ft_loss时,是需要把query和padding部分都mask掉吗,还是只mask掉query部分呢
> > 1. Did you do experiments on this to see which one performs better? > > ~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment....
> It seems that most of the code is copied from `train_ppo.py`. The difference between GRPO and PPO is just removing the value net and calculating the advantage only use...
> Hi, I want to confirm the implementation of `compute_approx_kl` in `openrlhf.models.utils`. The GRPO paper claims they use the unbiased estimator `pi_ref/pi - log(pi_ref/pi)-1` in http://joschu.net/blog/kl-approx.html. I see the blog...
Hi, I update the code to reduce its redundancy. The main ideas of this updates are as follows: * update ```train_grpo_ray.py``` supporting ray and vllm for grpo. * update ```Experience```...
> @LSX-Sneakerprogrammer Hi, after some offline discussion, it seems that PPO and GRPO can share lots of code and we have been working on preparing for such merge recently, by...
> We'll try to merge #466 this week, so both will be fine. > > BTW, I seems to me that we no longer need a separate `GRPOTrainer` or `ActorGRPOTrainer`...
> > computing GRPO advatanges needs to normalize the rewards with the same prompts > > The `process_experiences` method in #466 is added for doing such normalization. Hi, I see...