llama-stack
llama-stack copied to clipboard
[Evals API][9/n] SimpleQA evals
--continuation of https://github.com/meta-llama/llama-stack/pull/352
TL;DR
- Implement OpenAI's SimpleQA's Benchmark as ScoringFn (reference)
[RFC]
- Option 1: SimpleQAScoringFn: Move each benchmark eval into separate scoring function with it's own context.
- Option 2 (current): Single LLMAsJudgeScoring for SimpleQA.
Test
- Test client app in: https://github.com/meta-llama/llama-stack-apps/pull/105