llama-stack
llama-stack copied to clipboard

Published 20 hours ago •

Reame
Issues

[Evals API][9/n] SimpleQA evals

Open yanxi0830 opened this issue 1 year ago • 0 comments

--continuation of https://github.com/meta-llama/llama-stack/pull/352

TL;DR

Implement OpenAI's SimpleQA's Benchmark as ScoringFn (reference)

[RFC]

Option 1: SimpleQAScoringFn: Move each benchmark eval into separate scoring function with it's own context.
Option 2 (current): Single LLMAsJudgeScoring for SimpleQA.

Test

Test client app in: https://github.com/meta-llama/llama-stack-apps/pull/105

Nov 01 '24 07:11 yanxi0830