DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

[chat] generate process is not a single step in RL

Open ht-zhou opened this issue 1 year ago • 2 comments

Hi I am from the ColossalAI team. I found that there are similarities between DeepSpeedChat and ColossalChat. We found that there might be some implementation error in our code, thus possibly leading to some convergence problems. If you refer to our implementation, you may encounter similar problems as well. We are planning to fix the bug these two weeks.

Meanwhile, it would be appreciated if you could reference our work if you adapted from our implementation.

ht-zhou avatar Apr 12 '23 02:04 ht-zhou

The implementation of their ppo seems to be more similar to trlx. I use your colossalai ppo without gae algorithm, and the model can continuously optimize 20 generate epochs, but with trlx implementation, it may only optimize about 5 epochs. Can you share the implementation errors in colossalai that you mentioned?

piekey1994 avatar Apr 12 '23 08:04 piekey1994

For the most libs I believe they have some extent of similarity(e.g. TRL, TRLX, ClossalAI, RL4LM etc), so I am not sure if they are referencing your code or more generally the instructGPT methodologies ... Though it will be much appreciated if you can elaborate more on the bugs you encounter

Randolph-zeng avatar Apr 12 '23 08:04 Randolph-zeng