Shihan Dou comments

Results 50 comments of


                                            Shihan Dou

Can I run this pipeline on A100-40GB?

> Thanks and what is your RAW usage rate? The memory cost about 800~900GB with 2*8 of batch size and 4k multi-turn query in training phase.

PPO-max 对比原始PPO 的效果

> norm+clip的配置是否只会减缓这个问题的出现【作用和减小lr是一致的吗】。训练更多的step，仍会收敛到max-length上。是的我们尝试了最多5000step，kl-penalty会很大程度缓解甚至不会崩掉。更多trick只是缓解

PPO-max 对比原始PPO 的效果

> 请问随机种子是否也有影响？同样的超参数，随机种子不同收敛效果也不一样。另外为什么buffersize要尽量小呢？buffersize越小，方差越大，而且会导致batchsize也比较小我们使用相同的随机种子进行全部实验。buffersize尽可能小，意味着训练过程更加“on policy”。

Clarification on MetaRM-optimization Implementation

Hi, thank you very much for your attention! The release of the MetaRM code has been delayed as we have been occupied with paper submissions. We will be publishing the...

value model与reward model

是的 value model 的初始权重为reward model。是的value model将每个token 的 hidden size -> 一个标量

Why are you not releasing your reward model for english?

hi, the reward model for eng is at here: https://huggingface.co/Ablustrund/moss-rlhf-reward-model-7B-zh

PPO data en

Hi, thank you for your interest in this work! We have cleaned the dataset with some filtering algorithms, but for some confidentiality reasons, we can't release the dataset at the...

训练reward model的脚本

感谢您对本项目的关注～因为奖励模型的训练涉及一些提高奖励模型表现的方法，所以我们暂时还不能对奖励模型进行开源。我们预计会在8-9月分的PART II部分放出后，对奖励模型的训练进行开源，感谢您的关注和认可～

关于ppo阶段，reward分数计算的问题

@ruizheng20

自有的底座模型，自有的SFT权重，重新训练RM，可行么

您好。 1. 我们的代码支持llama和llama2，但是很容易扩充成其他decoder-only模型，例如bloomz，baichuan。只需要修改llama/下面对应的llama model和llama tokenizer即可。 2. 对于reward model我们暂时不支持，但是应该在月末我们会开源第二版本技术报告，包含对reward model的训练。