Shihan Dou

Results 50 comments of Shihan Dou

@panxb833 @LuciusMos I met the same problem as you. Do you know how to solve it?

Thank you for your great supports to us! Because reward model training involves more methods, this part will be explained in the second part of the technical report, thank you...

> Probably in August or September of this year, thanks for your interest.

您好,我们readme还没更新,会在模型放出后更新(约1天)

我们已经公布数据结构格式样例

Thanks for your attention. we have no plans to open source the **Chinese** dataset in the near future, we will update the list if we have open source plans.

In our experiments, the GPU memory cost about 50G in ZERO2 and without offload. if you have a larger cpu memory, may be you can use A100-40GB to train your...

PPO-max帮助研究者们找到更好的方法使RLHF训练过程稳定,PPO-max和原始PPO的区别更多的在于能不能够稳定训练RLHF过程,而不是效果的差异。

我们发现训练失败的模型,往往会重复一句话直到生成的Max length.