agent-lightning
agent-lightning copied to clipboard
about credit assignment
Great job!! I have a question about the source code logic that I would like to ask. Judging from a few examples, it seems that the agent directly returns a reward and then uses it for training. Regarding the credit assignment part (that is, the part where the multi-round trajectory units are decomposed and then samples are constructed), where is it implemented?
You can use emit_reward to generate intermediate reward signals.
However, current verl algorithm only supports identical credit assignment. For that part of customization, please refer to #31.