llama-stack
llama-stack copied to clipboard
[Evals API][6/n] meta-reference llm as judge, registration for ScoringFnDefs
--continuation of https://github.com/meta-llama/llama-stack/pull/323
TL;DR
- Add LLM as judge meta-reference impl. Judge using Llama Stack inference_api.
- Move ScoringFnDef to
*.jsonbased files for easier registration. - Support dynamic registering a ScoringFnDefs with LLMAsJudgeContext
Support Scoring Using a Judge
response = await scoring_impl.score_batch(
dataset_id=response[0].identifier,
scoring_functions=[
"meta-reference::llm_as_judge_8b_correctness",
],
)
Support Full Generate + Score using a Judge
response = await eval_impl.evaluate_batch(
dataset_id=response[0].identifier,
candidate=ModelCandidate(
model="Llama3.2-1B-Instruct",
sampling_params=SamplingParams(),
),
scoring_functions=[
"meta-reference::subset_of",
"meta-reference::llm_as_judge_8b_correctness",
],
)
Support Register A Judge using ScoringFnDef
# register the scoring function
await scoring_functions_impl.register_scoring_function(
ScoringFnDefWithProvider(
identifier="meta-reference::llm_as_judge_8b_random",
description="Llm As Judge Scoring Function",
parameters=[],
return_type=NumberType(),
context=LLMAsJudgeContext(
prompt_template="""
Output a number between 0 to 10. Your answer must match the format \n Number: <answer>
""",
judge_model="Llama3.1-8B-Instruct",
judge_score_regex=[r"Number: (\d+)"],
),
provider_id="test-meta",
)
)
scoring_functions = await scoring_functions_impl.list_scoring_functions()
Test
scoring using 3.1-8b judge
PROVIDER_ID=test-meta PROVIDER_CONFIG=llama_stack/providers/tests/scoring/provider_config_example.yaml pytest -s llama_stack/providers/tests/scoring/test_scoring.py --tb=short --disable-warnings
generate using 3.2-1b, score using 3.1-8b judge
PROVIDER_ID=test-meta PROVIDER_CONFIG=llama_stack/providers/tests/eval/provider_config_example.yaml pytest -s llama_stack/providers/tests/eval/test_eval.py --tb=short --disable-warnings
register a random judge function
results={
'meta-reference::llm_as_judge_8b_random': ScoringResult(score_rows=[
{
'score': 5,
'judge_feedback': 'Number: 5'
},
{
'score': 5,
'judge_feedback': 'Number: 5'
},
{
'score': 5,
'judge_feedback': 'Number: 5'
},
{
'score': 5,
'judge_feedback': 'Number: 5'
},
{
'score': 7,
'judge_feedback': 'Number: 7'
}
],
aggregated_results={
'average': 5.4
})
}