trnasition-level reward design
In a fixed workflow with multiple roles (each defined by a distinct system prompt), AgentLightning models each role’s I/O as transitions and may group them. Without role-level rewards, is training still effective, or does it degenerate into GRPO.
I think it's related to #31
Thank you for the reply. In the absence of role-level rewards and with a static workflow, can we regard the setup as equivalent to GRPO optimized with a trajectory-level reward?
I think it's slightly different. In GRPO for RLHF, each trajectory = one generated response. In our setup, each trajectory = multiple responses.