trnasition-level reward design

Open boardman0 opened this issue 3 months ago • 3 comments

In a fixed workflow with multiple roles (each defined by a distinct system prompt), AgentLightning models each role’s I/O as transitions and may group them. Without role-level rewards, is training still effective, or does it degenerate into GRPO.

Sep 26 '25 09:09 boardman0

I think it's related to #31

Sep 26 '25 16:09 ultmaster

Thank you for the reply. In the absence of role-level rewards and with a static workflow, can we regard the setup as equivalent to GRPO optimized with a trajectory-level reward?

Sep 26 '25 17:09 boardman0

I think it's slightly different. In GRPO for RLHF, each trajectory = one generated response. In our setup, each trajectory = multiple responses.

Nov 29 '25 14:11 ultmaster