about the training details of step3 in DeepSpeed-Chat: PPO
Regarding the two parts of generation training data and PPO training in the code(applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py), I think that the current training is more like the onPolicy method. Because per_device_train_batch_size==per_device_mini_train_batch_size, now the generating train data(through the experiment) and the PPO training at the same time. But I remember that the paper used the offPolicy training method, that is, after generating a large amount of data through experiments, and then performing PPO training, than multiple cycles. Is my understanding wrong? Or is the onPolicy method used now more effective?
Hi, yes, you are right. For offline case, we found it is very easy to diverge. Please take a look at https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md.
If you find a way to make multiple steps training work, please do not hesitate to let us know :).
Thanks
Closed as no followup