DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

about the training details of step3 in DeepSpeed-Chat: PPO

Open guijuzhejiang opened this issue 2 years ago • 1 comments

Regarding the two parts of generation training data and PPO training in the code(applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py), I think that the current training is more like the onPolicy method. Because per_device_train_batch_size==per_device_mini_train_batch_size, now the generating train data(through the experiment) and the PPO training at the same time. But I remember that the paper used the offPolicy training method, that is, after generating a large amount of data through experiments, and then performing PPO training, than multiple cycles. Is my understanding wrong? Or is the onPolicy method used now more effective?

guijuzhejiang avatar Apr 22 '23 06:04 guijuzhejiang

Hi, yes, you are right. For offline case, we found it is very easy to diverge. Please take a look at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md.

If you find a way to make multiple steps training work, please do not hesitate to let us know :).

Thanks

yaozhewei avatar Apr 24 '23 03:04 yaozhewei

Closed as no followup

yaozhewei avatar May 05 '23 18:05 yaozhewei