Shihan Dou comments

Results 50 comments of


                                            Shihan Dou

use bloom-350m to train reward model in step2

@panxb833 @LuciusMos I met the same problem as you. Do you know how to solve it?

The release of reward model training code?

Thank you for your great supports to us! Because reward model training involves more methods, this part will be explained in the second part of the technical report, thank you...

The release of reward model training code?

> Probably in August or September of this year, thanks for your interest.

用于PPO训练的数据结构

您好，我们readme还没更新，会在模型放出后更新（约1天）

用于PPO训练的数据结构

我们已经公布数据结构格式样例

Release of the Chinese Dataset.

Thanks for your attention. we have no plans to open source the **Chinese** dataset in the near future, we will update the list if we have open source plans.

Can I run this pipeline on A100-40GB?

In our experiments, the GPU memory cost about 50G in ZERO2 and without offload. if you have a larger cpu memory, may be you can use A100-40GB to train your...

Can I run this pipeline on A100-40GB?

with 2*8 of batch size.

PPO-max 对比原始PPO 的效果

PPO-max帮助研究者们找到更好的方法使RLHF训练过程稳定，PPO-max和原始PPO的区别更多的在于能不能够稳定训练RLHF过程，而不是效果的差异。

PPO-max 对比原始PPO 的效果

我们发现训练失败的模型，往往会重复一句话直到生成的Max length.