pyllama
pyllama copied to clipboard
Share your evaluate result
We evaluate llama using 100 examples of the SQuAD dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.
For
amodel completion a and a reference list of correct answersBinclude:any([(a in b) for b in B])
| model | squad(100) |
|---|---|
| alpaca-lora-7b | 0.88 |
| llama-7b | 0.63 |
| gpt-3.5-turbo | 0.9 |
| text-davinci-003 | 0.87 |
| text-davinci-002 | 0.66 |
| text-davinci-001 | 0.58 |
| ada | 0.35 |
Thanks for the result. @jeff3071 can you share your evaluation code?
You can refer to my code at https://github.com/open-evals/evals .
Hi @jeff3071 ! Here are some results from other datasets in openai, for your information :
| model | crepe(100) include |
born-first(122) match |
anagrams(357) match |
balance-chemical-equation(100) match |
bigrams(200) match |
|---|---|---|---|---|---|
| alpaca-lora-7b | 0.2 | 0.5 | 0 | 0 | 0 |
| gpt-3.5-turbo | 0.45 | 0.64 | 0.29 | 0.31 | 0.18 |
| text-davinci-003 | 0.19 | 0.49 | 0.199 | 0.07 | 0.595 |