liushixuan comments

Results 10 comments of


                                            liushixuan

smart对抗训练

> 好的，我回顾了一下这个陈年项目，可以解答这个问题。 > > 可以从uf/task/base.py文件中检索关键词“self.init_tilda_op”看到，这个op实际上在整个训练流程里，只在训练开始时作为初始化运行一次，并不是每个epoch开头都运行一次。这里的“runs at the start of each epoch”是我注释错了，下个版本我会纠正过来，改为“runs at the start of traning”。 > > 谢谢你的观察~ 谢谢你的解答~

Does it support lora and pipeline parallel now?

Hi, do you have any plans on supporting Pipeline parallelism now?

labels != -100的作用是什么

> 计算response长度；如果实现不符合预期就是bug 感谢您的解答！我想再问一下，对response计算rrhf_loss和ft_loss时，是需要把query和padding部分都mask掉吗，还是只mask掉query部分呢

Question about average_log_prob

> > 1. Did you do experiments on this to see which one performs better? > > ~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment....

Add grpo trainer

> It seems that most of the code is copied from `train_ppo.py`. The difference between GRPO and PPO is just removing the value net and calculating the advantage only use...

Add grpo trainer

> Hi, I want to confirm the implementation of `compute_approx_kl` in `openrlhf.models.utils`. The GRPO paper claims they use the unbiased estimator `pi_ref/pi - log(pi_ref/pi)-1` in http://joschu.net/blog/kl-approx.html. I see the blog...

Add grpo trainer

Hi, I update the code to reduce its redundancy. The main ideas of this updates are as follows: * update ```train_grpo_ray.py``` supporting ray and vllm for grpo. * update ```Experience```...

Add grpo trainer

> @LSX-Sneakerprogrammer Hi, after some offline discussion, it seems that PPO and GRPO can share lots of code and we have been working on preparing for such merge recently, by...

Add grpo trainer

> We'll try to merge #466 this week, so both will be fine. > > BTW, I seems to me that we no longer need a separate `GRPOTrainer` or `ActorGRPOTrainer`...

Add grpo trainer

> > computing GRPO advatanges needs to normalize the rewards with the same prompts > > The `process_experiences` method in #466 is added for doing such normalization. Hi, I see...