LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

What's the differences between RAFT and InstructGPT's PPO?

Open Unrealluver opened this issue 1 year ago • 1 comments

Greetings! I am very interested in your work RAFT. When I read the paper, you still use the RL methods to score the model's output. It isn't very clear to distinguish the main difference between RAFT and InstructGPT-PPO. Could you help provide more comparisons for them?

Unrealluver avatar Apr 23 '23 03:04 Unrealluver

Hi, thanks for your interest!

PPO is an on-policy policy-based DRL method, which can be used to achieve a high reward by constructing the task as an MDP. RAFT is a supervised-learning-based method but also iteratively uses the samples generated by the generator (either the current model or some expert model).

In comparison, supervised learning converges faster and is more robust compared to the DRL method, and the design of RAFT also can use samples beyond a pre-collected dataset as compared to the traditional supervised fine-tuning. The main motivation here is that we are training a generative model, and a good reward function can naturally be regarded as a criterion for us to pick high-quality samples, where such a downsampling may also be preferred for computational considerations. Moreover, RAFT basically alters between inference and SFT, which are not coupled with each other and is easier to implement. Another advantage may be that RAFT can take diverse sources of samples, while PPO is an on-policy DRL method and can only use the samples generated by itself.

RAFT definitely still has many limitations so far, and we are still working in progress to improve it. Welcome to try it out and give suggestions!

WeiXiongUST avatar Apr 23 '23 12:04 WeiXiongUST

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

shizhediao avatar May 15 '23 00:05 shizhediao