ColossalAI
ColossalAI copied to clipboard
Shouldn't the output of the critic be generated for each individual time step, rather than averaged over the entire sequence?
https://github.com/hpcaitech/ColossalAI/blob/29386a54e66d7e5ca40cabf1686839fba9aac71d/applications/ChatGPT/chatgpt/models/base/critic.py#L46
Hello, @zhangyipin thanks for your questions!
Regarding this issue, it is actually controversial by now and the key point is how to define the step.
(1) If consider each inference (one token) as one step, then generate is an episode, the step reward is the KL divergence and episode-end reward is the output of reward model.The new state will be (state + token).
(2) If consider each generate (sequence) as one step (episode), and the reward will be produced by KL divergence and the reward model in average manager.
Anyway, it is an open discussion and welcome to share your understanding and opinion!
This issue was closed due to inactivity. Thanks.