simple-evals How do we run this code?

Yes, how do we run this code to evaluate a language model?

Nov 05 '24 12:11 atahanuz

I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?

I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.

Nov 21 '24 03:11 hank0316

GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:

Use the SimpleQA dataset to query LLaMA-3 for results.
Score the results using the evaluation method of SimpleQA.

Nov 26 '24 08:11 zhejunliux

Here is an example I forked, where any OpenAI compatible LLM endpoint can be run against: https://github.com/ECNU3D/agentic-simple-evals/blob/main/simple_evals.py#L185

Jun 05 '25 01:06 ECNU3D