feat(evaluation): add core evaluation framework
This PR introduces an evaluation framework for testing and measuring AI agent performance, it supports both algorithmic and LLM-as-Judge evaluation methods, with built-in support for response quality, tool usage, safety, and hallucination detection.
[!TIP] This PR uses atomic commits organized by feature. For the best review experience, I suggest to review commit-by-commit to see the logical progression of the implementation.
[!NOTE] I follow conventional commits specification for a structured commit history.
Features:
- Evaluation Methods
- Algorithmic evaluators (ROUGE-1 scoring, exact matching);
- LLM-as-Judge with customizable rubrics;
- Multi-sample evaluation.
- 8 Metrics
- Response quality: match score, semantic matching, coherence, rubric-based;
- Tool usage: trajectory scoring, rubric-based quality;
- Safety & quality: harmlessness, hallucination detection.
- Flexible Storage
- In-memory storage for development/testing;
- File-based storage with JSON persistence for CI/CD;
Usage
// Create evaluation runner
evalRunner := evaluation.NewRunner(evaluation.RunnerConfig{
AgentRunner: agentRunner,
Storage: evalStorage,
SessionService: sessionService,
AppName: "my-app",
RateLimitDelay: 6 * time.Second,
MaxConcurrentEvals: 10,
})
// Define evaluation criteria
config := &evaluation.EvalConfig{
JudgeLLM: judgeLLM,
JudgeModel: "gemini-2.5-flash",
Criteria: []evaluation.Criterion{
&evaluation.Threshold{
MinScore: 0.8,
MetricType: evaluation.MetricResponseMatch,
},
&evaluation.LLMAsJudgeCriterion{
Threshold: &evaluation.Threshold{
MinScore: 0.9,
MetricType: evaluation.MetricSafety,
},
MetricType: evaluation.MetricSafety,
JudgeModel: "gemini-2.5-flash",
},
},
}
// Run evaluation
result, err := evalRunner.RunEvalSet(ctx, evalSet, config)
Testing
2 examples are provided to demonstrate the features:
- examples/evaluation/basic/ - Simple introduction with 2 evaluators;
- examples/evaluation/comprehensive/ - Full feature example with all the 8 metrics.
Run examples:
export GOOGLE_API_KEY=your_key
cd examples/evaluation/basic
go run main.go
cd examples/evaluation/comprehensive
go run main.go
I saw your question in the community call. I'll have to defer to @mazas-google on roadmap questions regarding Eval.
I imagine the API will closely follow adk-python's implementation.