feat(evaluation): add core evaluation framework

Open stefanoamorelli opened this issue 2 months ago • 2 comments

This PR introduces an evaluation framework for testing and measuring AI agent performance, it supports both algorithmic and LLM-as-Judge evaluation methods, with built-in support for response quality, tool usage, safety, and hallucination detection.

[!TIP] This PR uses atomic commits organized by feature. For the best review experience, I suggest to review commit-by-commit to see the logical progression of the implementation.

[!NOTE] I follow conventional commits specification for a structured commit history.

Features:

Evaluation Methods
- Algorithmic evaluators (ROUGE-1 scoring, exact matching);
- LLM-as-Judge with customizable rubrics;
- Multi-sample evaluation.
8 Metrics
- Response quality: match score, semantic matching, coherence, rubric-based;
- Tool usage: trajectory scoring, rubric-based quality;
- Safety & quality: harmlessness, hallucination detection.
Flexible Storage
- In-memory storage for development/testing;
- File-based storage with JSON persistence for CI/CD;

Usage

// Create evaluation runner
  evalRunner := evaluation.NewRunner(evaluation.RunnerConfig{
      AgentRunner:        agentRunner,
      Storage:            evalStorage,
      SessionService:     sessionService,
      AppName:            "my-app",
      RateLimitDelay:     6 * time.Second,
      MaxConcurrentEvals: 10,
  })

  // Define evaluation criteria
  config := &evaluation.EvalConfig{
      JudgeLLM:   judgeLLM,
      JudgeModel: "gemini-2.5-flash",
      Criteria: []evaluation.Criterion{
          &evaluation.Threshold{
              MinScore:   0.8,
              MetricType: evaluation.MetricResponseMatch,
          },
          &evaluation.LLMAsJudgeCriterion{
              Threshold: &evaluation.Threshold{
                  MinScore:   0.9,
                  MetricType: evaluation.MetricSafety,
              },
              MetricType: evaluation.MetricSafety,
              JudgeModel: "gemini-2.5-flash",
          },
      },
  }

  // Run evaluation
  result, err := evalRunner.RunEvalSet(ctx, evalSet, config)

Testing

2 examples are provided to demonstrate the features:

examples/evaluation/basic/ - Simple introduction with 2 evaluators;
examples/evaluation/comprehensive/ - Full feature example with all the 8 metrics.

Run examples:

export GOOGLE_API_KEY=your_key
cd examples/evaluation/basic
go run main.go

cd examples/evaluation/comprehensive
go run main.go

Nov 10 '25 09:11 stefanoamorelli

I saw your question in the community call. I'll have to defer to @mazas-google on roadmap questions regarding Eval.

I imagine the API will closely follow adk-python's implementation.

Nov 19 '25 18:11 ivanmkc