simple-evals icon indicating copy to clipboard operation
simple-evals copied to clipboard

How do we run this code?

Open atahanuz opened this issue 1 year ago • 3 comments

Yes, how do we run this code to evaluate a language model?

atahanuz avatar Nov 05 '24 12:11 atahanuz

I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?

I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.

hank0316 avatar Nov 21 '24 03:11 hank0316

GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:

  1. Use the SimpleQA dataset to query LLaMA-3 for results.
  2. Score the results using the evaluation method of SimpleQA.

zhejunliux avatar Nov 26 '24 08:11 zhejunliux

Here is an example I forked, where any OpenAI compatible LLM endpoint can be run against: https://github.com/ECNU3D/agentic-simple-evals/blob/main/simple_evals.py#L185

ECNU3D avatar Jun 05 '25 01:06 ECNU3D