llama-stack
llama-stack copied to clipboard
Evals API MVP
DevX Flow
Step 1. Register Eval Dataset
python -m llama_stack.apis.datasets.client
Step 2. Run Eval Scorer
python -m llama_stack.apis.evals.client
- (benchmark) run full preprocess->generation->postprocess->score eval task flow
- (evaluate score only) run scorer only on prepared eval dataset w/ columns
expected_answerandgenerated_answer
LLM As Judge
-
Scorer for braintrust AnswerCorrectness()
-
Scorer for using judge from LlamaStack distribution hosted models (via inference_api)
Eval Dataset Template Schema
@json_schema_type
class PostprocessedGeneration(BaseModel):
completion_message: str
logprobs: Optional[List[TokenLogProbs]] = None
@json_schema_type
class ScorerInputSample(DatasetSample):
"""
A dataset is required to have the following columns to be used for scoring:
- generated_answer: str
- expected_answer: Union[str, List[str]]
- (optional) input_query: str
- (optional) generation_output: PostprocessedGeneration
"""
generated_answer: str
expected_answer: Union[str, List[str]]
input_query: Optional[str] = None
generation_output: Optional[PostprocessedGeneration] = None
High-Level Changes
1. New endpoints API added
-
/datasetsfor registering/deleting datasets- /datasets/create --> add new datasets to distribution. supports custom file upload / url / huggingface datasets.
- /datasets/get
- /datasets/delete
- /datasets/list
-
/evalsfor running evaluation tasks/evals/run_eval_task--> run full eval including preprocessing -> generation -> postprocessing -> scoring/evals/run_scorer--> run scoring only
2. New datastructures added
Registry: for maintaining datasets, scorers, processorsBaseScorer: evaluation methods for scoringBaseDataset: supports custom datasets / huggingface datasetsBaseGeneratorProcessor: performing preprocessing/postprocessing before/after generationBaseGenerator: performing inference generation
(experimental) eleuther harness
- integrate w/ Eleuther Eval Harness
- add custom task for eleuther eval harness on
mmlu_pro(e.g. recipe) - dummy
loglikelihoodoutputs for inference
-- Not in this PR 🚧 jobs API for background eval job scheduling 🚧 batch inference 🚧 persist intermediate datasets during run_eval_task