multi-agent system
I have read your paper and code, the work is beautiful and practical.
My question is as follows:
Taking the SQL Agent as an example: if, in a multi-agent system, the final outcome reward is set directly equal to each sub-agent’s reward, without designing separate rewards for each sub-agent, then the framework’s transition classification will not come into play, and the training will be equivalent to not classifying and directly using GRPO for trajectory-level optimization. Is this understanding correct?
Hi @boardman0
Thanks for your interest in agent lightning! I am not sure what you mean about transition classification. We do not do classification for transitions. What we do in our algorithm is a minimal design: for one task, we collect all trajectories and break each trajectory into transitions, and all transitions of this task will be grouped together in GPRO. Each transition is (llm input, llm output, reward). We can do multi-agent (implemented with different instructions) because we do not have any constraint on the LLM input. It's just input tokens of LLM. So, input can also contain instructions for different agents.
Besides, it seems not clear to directly use GRPO for trajectory-level optimization in multi-agent scenarios. Previous method may require to concatenate all transitions in one trajectory into one sample. Then, how to concatenate in multi-agent scenarios?
Hope this answer can resolve your question!