Can I use https://github.com/microsoft/llm-as-judge as the search reward for GRPO training by combining it in?
Openai fake code, is it correct?
from agl import emit_reward, span
def search_tool(query):
docs = real_search(query)
# 1) 组织 judge 输入
judge_input = {
"query": query,
"docs": [{"title": d.title, "snippet": d.snippet} for d in docs[:5]],
}
# 2) 调 llm-as-judge 打分(返回 1~10)
score_1_10 = llm_as_judge(judge_input, rubric="relevance+coverage")
reward = (score_1_10 - 1) / 9.0 # -> 0~1
# 3) 发 reward(step-level)
emit_reward(name="search_quality", value=reward)
return docs
if Chinese People, we can talk with Wechat: johnsongzc
I think it makes sense, though I don't know for sure how llm_as_judge works.
You could apply an adapter to filter the spans so that the last empty span would have its reward brodcasted to the previous spans.
class MergedRewardAdapter(TracerTraceToTriplet): """自定义 Adapter,将独立的 Reward Span 合并到前一个 Action Triplet 中。""" def adapt(self, spans: list) -> list: # 1. 先调用父类的 adapt 方法,获取原始的 Triplet 列表 # 此时列表里可能包含 [有内容没分数的 Triplet, 没内容有分数的 Triplet] triplets = super().adapt(spans) merged_triplets = [] for triplet in triplets: # 2. 判断当前 Triplet 是否是一个“仅有 Reward”的 Triplet is_reward_only = False response = getattr(triplet, "response", None) # 检查 response 是否为空 (可能是空字典、空 token_ids 或者 None) if isinstance(response, dict): if not response.get("token_ids") and not response.get("content"): is_reward_only = True elif not response: is_reward_only = True # 3. 如果这是一个“仅有 Reward”的 Triplet,且它确实有分数 if is_reward_only and getattr(triplet, "reward", None) is not None: if merged_triplets: # 关键步骤:把这个 Triplet 的 Reward 赋值给列表里的 上一个 Triplet # 也就是把刚才那个“有内容没分数”的 Action 补上分数 try: merged_triplets[-1].reward = triplet.reward except (AttributeError, TypeError): pass else: # 4. 如果是正常的 Triplet (有 Action 内容),直接加入列表 merged_triplets.append(triplet) return merged_triplets