ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Shouldn't the output of the critic be generated for each individual time step, rather than averaged over the entire sequence?

Open zhangyipin opened this issue 1 year ago • 1 comments

https://github.com/hpcaitech/ColossalAI/blob/29386a54e66d7e5ca40cabf1686839fba9aac71d/applications/ChatGPT/chatgpt/models/base/critic.py#L46

zhangyipin avatar Mar 08 '23 03:03 zhangyipin

Hello, @zhangyipin thanks for your questions!

Regarding this issue, it is actually controversial by now and the key point is how to define the step.

(1) If consider each inference (one token) as one step, then generate is an episode, the step reward is the KL divergence and episode-end reward is the output of reward model.The new state will be (state + token).

(2) If consider each generate (sequence) as one step (episode), and the reward will be produced by KL divergence and the reward model in average manager.

Anyway, it is an open discussion and welcome to share your understanding and opinion!

Camille7777 avatar Mar 17 '23 10:03 Camille7777

This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 10:04 binmakeswell