about credit assignment

Open Kenwwww opened this issue 2 months ago • 1 comments

Great job!! I have a question about the source code logic that I would like to ask. Judging from a few examples, it seems that the agent directly returns a reward and then uses it for training. Regarding the credit assignment part (that is, the part where the multi-round trajectory units are decomposed and then samples are constructed), where is it implemented?

Nov 06 '25 03:11 Kenwwww

You can use emit_reward to generate intermediate reward signals.

However, current verl algorithm only supports identical credit assignment. For that part of customization, please refer to #31.

Nov 10 '25 00:11 ultmaster