How do we run this code?
Yes, how do we run this code to evaluate a language model?
I'm also wandering how to run the evaluations on different models like llama-3. Can anyone clarify the process of evaluating models other than GPT & Claude?
I think firstly we need to implement the sampler for ourselves. After that, add the model we want to evaluate with the corresponding sampler to the models dict in simple_evals.py. But I'm not sure if this is the correct way to evaluate local models.
GPT and Claude are scoring models, not evaluation models. The evaluation data and methods (scoring methods) published in this code are as follows. To evaluate LLaMA-3, the following steps are recommended:
- Use the SimpleQA dataset to query LLaMA-3 for results.
- Score the results using the evaluation method of SimpleQA.
Here is an example I forked, where any OpenAI compatible LLM endpoint can be run against: https://github.com/ECNU3D/agentic-simple-evals/blob/main/simple_evals.py#L185