agent-lightning icon indicating copy to clipboard operation
agent-lightning copied to clipboard

Can I use https://github.com/microsoft/llm-as-judge as the search reward for GRPO training by combining it in?

Open johnson7788 opened this issue 1 month ago • 1 comments

Openai fake code, is it correct?

from agl import emit_reward, span

def search_tool(query):
    docs = real_search(query)

    # 1) 组织 judge 输入
    judge_input = {
        "query": query,
        "docs": [{"title": d.title, "snippet": d.snippet} for d in docs[:5]],
    }

    # 2) 调 llm-as-judge 打分(返回 1~10)
    score_1_10 = llm_as_judge(judge_input, rubric="relevance+coverage")
    reward = (score_1_10 - 1) / 9.0  # -> 0~1

    # 3) 发 reward(step-level)
    emit_reward(name="search_quality", value=reward)

    return docs

if Chinese People, we can talk with Wechat: johnsongzc

johnson7788 avatar Dec 04 '25 03:12 johnson7788

I think it makes sense, though I don't know for sure how llm_as_judge works.

ultmaster avatar Dec 04 '25 06:12 ultmaster

You could apply an adapter to filter the spans so that the last empty span would have its reward brodcasted to the previous spans.

class MergedRewardAdapter(TracerTraceToTriplet):     """自定义 Adapter,将独立的 Reward Span 合并到前一个 Action Triplet 中。"""         def adapt(self, spans: list) -> list:         # 1. 先调用父类的 adapt 方法,获取原始的 Triplet 列表         # 此时列表里可能包含 [有内容没分数的 Triplet, 没内容有分数的 Triplet]         triplets = super().adapt(spans)         merged_triplets = []                 for triplet in triplets:             # 2. 判断当前 Triplet 是否是一个“仅有 Reward”的 Triplet             is_reward_only = False             response = getattr(triplet, "response", None)                         # 检查 response 是否为空 (可能是空字典、空 token_ids 或者 None)             if isinstance(response, dict):                 if not response.get("token_ids") and not response.get("content"):                     is_reward_only = True             elif not response:                 is_reward_only = True                             # 3. 如果这是一个“仅有 Reward”的 Triplet,且它确实有分数             if is_reward_only and getattr(triplet, "reward", None) is not None:                 if merged_triplets:                     # 关键步骤:把这个 Triplet 的 Reward 赋值给列表里的 上一个 Triplet                     # 也就是把刚才那个“有内容没分数”的 Action 补上分数                     try:                         merged_triplets[-1].reward = triplet.reward                     except (AttributeError, TypeError):                         pass             else:                 # 4. 如果是正常的 Triplet (有 Action 内容),直接加入列表                 merged_triplets.append(triplet)                         return merged_triplets

yanfei-zhang-95 avatar Dec 07 '25 03:12 yanfei-zhang-95