adk-python rubric_based_tool_use_quality_v1 is flaky when the test case have multiple turns

Describe the bug In the ADK Eval, when there is multiple turns in the test case, the rubric_based_tool_use_quality_v1 seems very flaky. Its not well documented that if there is multiple turns in the *.test.json, how we can differentiate the rubrics for tool calls done by different turns in test_config.json.

To Reproduce In a agentic setup, create a eval test case with multiple turns like this:

      "conversation": [
        {
          "user_content": {
            "role": "user",
            "parts": [
              {
                "text": "Do ABC"
              }
            ]
          }
        },
        {
          "user_content": {
            "role": "user",
            "parts": [
              {
                "text": "Do XYZ"
              }
            ]
          }
        }

such that both of the message call set of tools and add rubric_based_tool_use_quality_v1. The score becomes very random and we can't differentiate if we need different set of rubrics for different turns or not.

Expected behavior Rubrics should be differentiable for different turns

Desktop (please complete the following information):

OS: mac
Python version(python -V): 3.12
ADK version(pip show google-adk): 1.18

Model Information:

Are you using LiteLLM: Yes

Nov 18 '25 05:11 miyannishar

@miyannishar you can actually differentiate rubrics per turn Instead of putting all rubrics in conversation_level_rubrics, add turn-specific rubrics directly to each invocation using the rubrics field. For example, in your test JSON, add "rubrics": [{"rubric_id": "...", "rubric_content": {"text_property": "..."}}] to each invocation object in the conversation array. Keep only common rubrics (applicable to all turns) in conversation_level_rubrics. This way each turn gets evaluated with its own specific rubrics plus the common ones. Thanks!

Nov 18 '25 09:11 surajksharma07

Can you reference me to the code implementing the use of rubric for multi-turn. I implemented this using the multi turn and the results are still very inconsistent.

Nov 18 '25 18:11 miyannishar

the current rubric_based_tool_use_quality_v1 implementation doesn't support per-invocation rubrics. All rubrics from the metric's criterion are evaluated against all turns (see rubric_based_tool_use_quality_v1.py:193-194). The inconsistency comes from LLM sampling variance (5 samples per turn) and mean aggregation across turns. Workaround: encode turn-specific context in your rubric text (e.g., "In the first turn, did the agent call tool X?"). Reference: src/google/adk/evaluation/rubric_based_tool_use_quality_v1.py and rubric_based_evaluator.py.

Nov 19 '25 13:11 surajksharma07

@surajksharma07

Approaches tried (none worked reliably):

Single comprehensive rubric
Multiple turn-specific rubrics
Individual tool-call rubrics (7 separate rubrics)
Explicit scoring instructions ("Score 1.0 if present, 0.0 if absent")
Different rubric_id naming conventions
Simplified vs. detailed rubric text
Exact user input matching vs. partial matching

Root cause: The evaluation framework’s rubric matching logic appears flawed.

Nov 25 '25 06:11 miyannishar

Thanks for the detailed testing - you're right that this goes beyond configuration.

The issue is that format_auto_rater_prompt (line 193-194) uses all rubrics from the criterion for every turn without filtering by invocation context, and the 5-sample LLM-as-judge with mean aggregation amplifies inconsistency when rubrics don't match the turn.

Best immediate workaround:

split multi-turn tests into separate single-turn cases, or use very explicit turn references in rubric text ("Turn 1 specifically" vs "Turn 2 specifically").

I'll investigate if we can add per-turn rubric filtering support to make this more robust for multi-agent workflows.

Dec 01 '25 08:12 surajksharma07