ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[chatgpt]: the policy loss and value loss of PPO may be incorrect

Open wenjunyang opened this issue 1 year ago β€’ 7 comments

In training_step of PPOTrainer, the advantage and penalty are computed by the old Actor during making experience. while the ValueLoss computed with clamp. i think they are incorrect.here is the reason:

  1. penalty should be recomputed by the current Actor.
  2. Critic is indepentant with action, and no matter the experence is collected by the old Actor or the current Actor, the distribution of state(prompts) are the same. so the ValueLoss should be the same as A2C without PPO.

Below is my conclusion about the policy loss and the value loss: image

Red part in below should be collected by the old Actor during making experience: image

Green part in below should be computed by the current Actor and Critic: image

So, please consider the above questions, I will submit a PR, you can compare the difference by code.

wenjunyang avatar Mar 09 '23 15:03 wenjunyang

more details: here image image image image

wenjunyang avatar Mar 10 '23 00:03 wenjunyang

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


more details: https://openreview.net/forum?id=chDrutUTs0K) image image image image

Issues-translate-bot avatar Mar 10 '23 00:03 Issues-translate-bot

Thanks for your feedback. 1\ the change in value_loss computation is reasonable because of the change of the input of critic from sentence to prompt; 2\I have think about your last pr again.And I think we can use action-value function here (which Openai only menentioned as value funciton)

ht-zhou avatar Mar 10 '23 02:03 ht-zhou

Because RM's input is sequence, so it will be more reasonable here to use (state + action)as input. We will set a bool flag here to offer different choice.

ht-zhou avatar Mar 10 '23 02:03 ht-zhou

As for this problem, what you wanna to do is to change the calculation of ADV from Experience-maker to PPO-trainer, I have to think more about the details here. You can notice that we compute KL_div between action_log and base_log and the base_model is never changed, so the whole algo has many difference to classic PPO, we have to consider more about it.

ht-zhou avatar Mar 10 '23 02:03 ht-zhou

here are two my points:

  1. ADV can calculated in Experience-maker, but the KL penalty between Actor and base should be calculated in training step using current Actor, instead of using old Actor.
  2. Critic is indenpent with action, so value loss should not care about change of action distribution. through Critic and RM are the same model structure, their input are not same. Critic take prompt as input. RM take (prompt, action) or only action as input, i think the two are all ok, because action prob already have infomation about prompt, i think (prompt, action) is better.

wenjunyang avatar Mar 10 '23 02:03 wenjunyang