llama
llama copied to clipboard
Share your evaluate result
We evaluate llama using 100 examples of the SQuAD dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.
For
amodel completion a and a reference list of correct answersBinclude:any([(a in b) for b in B])
| model | squad(100) |
|---|---|
| alpaca-lora-7b | 0.88 |
| llama-7b | 0.63 |
| gpt-3.5-turbo | 0.9 |
| text-davinci-003 | 0.87 |
| text-davinci-002 | 0.66 |
| text-davinci-001 | 0.58 |
| ada | 0.35 |
Thank you for sharing your results. Could you share your instruction/template as well?
Thank you for asking.
I utilize Open-evals to convert the data into chat format. The format includes a question to be solved based on a given context, followed by a blank space for the response. Here's an example:
Solve the question based on the context.
Context: {data}
Assistant:
I found that the answer often repeat the prompt and itself. To fix it, I do some postprocess to get the correct answer by taking the sentence immediately after the repeat prompt.
This project uses <Your llama server ip>/prompt as the default API URL, but you can modify this by editing the evals/plugins/llama.py file.
Hey @jeff3071 Can you share your script for evaluation? I am trying to replicate the results but there is considerable drop in accuracy.