ColossalAI
ColossalAI copied to clipboard
[chatgpt]: the policy loss and value loss of PPO may be incorrect
In training_step of PPOTrainer, the advantage
and penalty
are computed by the old Actor during making experience. while the ValueLoss computed with clamp. i think they are incorrect.here is the reason:
- penalty should be recomputed by the current Actor.
- Critic is indepentant with action, and no matter the experence is collected by the old Actor or the current Actor, the distribution of state(prompts) are the same. so the ValueLoss should be the same as A2C without PPO.
Below is my conclusion about the policy loss and the value loss:
Red part in below should be collected by the old Actor during making experience:
Green part in below should be computed by the current Actor and Critic:
So, please consider the above questions, I will submit a PR, you can compare the difference by code.
more details: here
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
more details: https://openreview.net/forum?id=chDrutUTs0K)
Thanks for your feedback. 1\ the change in value_loss computation is reasonable because of the change of the input of critic from sentence to prompt; 2\I have think about your last pr again.And I think we can use action-value function here (which Openai only menentioned as value funciton)
Because RM's input is sequence, so it will be more reasonable here to use (state + action)as input. We will set a bool flag here to offer different choice.
As for this problem, what you wanna to do is to change the calculation of ADV from Experience-maker to PPO-trainer, I have to think more about the details here. You can notice that we compute KL_div between action_log and base_log and the base_model is never changed, so the whole algo has many difference to classic PPO, we have to consider more about it.
here are two my points:
- ADV can calculated in Experience-maker, but the KL penalty between Actor and base should be calculated in training step using current Actor, instead of using old Actor.
- Critic is indenpent with action, so value loss should not care about change of action distribution. through Critic and RM are the same model structure, their input are not same. Critic take prompt as input. RM take (prompt, action) or only action as input, i think the two are all ok, because action prob already have infomation about prompt, i think (prompt, action) is better.