llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Evals API MVP

Open yanxi0830 opened this issue 1 year ago • 0 comments

DevX Flow

Step 1. Register Eval Dataset
python -m llama_stack.apis.datasets.client
image
Step 2. Run Eval Scorer
python -m llama_stack.apis.evals.client
image
  • (benchmark) run full preprocess->generation->postprocess->score eval task flow
image
  • (evaluate score only) run scorer only on prepared eval dataset w/ columns expected_answer and generated_answer
LLM As Judge
  • Scorer for braintrust AnswerCorrectness() image

  • Scorer for using judge from LlamaStack distribution hosted models (via inference_api) image

Eval Dataset Template Schema
@json_schema_type
class PostprocessedGeneration(BaseModel):
    completion_message: str
    logprobs: Optional[List[TokenLogProbs]] = None

@json_schema_type
class ScorerInputSample(DatasetSample):
    """
    A dataset is required to have the following columns to be used for scoring:
    - generated_answer: str
    - expected_answer: Union[str, List[str]]
    - (optional) input_query: str
    - (optional) generation_output: PostprocessedGeneration
    """

    generated_answer: str
    expected_answer: Union[str, List[str]]
    input_query: Optional[str] = None
    generation_output: Optional[PostprocessedGeneration] = None

High-Level Changes

1. New endpoints API added
  • /datasets for registering/deleting datasets

    • /datasets/create --> add new datasets to distribution. supports custom file upload / url / huggingface datasets.
    • /datasets/get
    • /datasets/delete
    • /datasets/list
  • /evals for running evaluation tasks

    • /evals/run_eval_task --> run full eval including preprocessing -> generation -> postprocessing -> scoring
    • /evals/run_scorer --> run scoring only
2. New datastructures added
  • Registry: for maintaining datasets, scorers, processors
    • BaseScorer: evaluation methods for scoring
    • BaseDataset: supports custom datasets / huggingface datasets
    • BaseGeneratorProcessor: performing preprocessing/postprocessing before/after generation
    • BaseGenerator: performing inference generation

(experimental) eleuther harness

  • integrate w/ Eleuther Eval Harness
  • add custom task for eleuther eval harness on mmlu_pro (e.g. recipe)
  • dummy loglikelihood outputs for inference image

-- Not in this PR 🚧 jobs API for background eval job scheduling 🚧 batch inference 🚧 persist intermediate datasets during run_eval_task

yanxi0830 avatar Oct 10 '24 18:10 yanxi0830