llama Share your evaluate result

Share your evaluate result

Open jeff3071 opened this issue 2 years ago • 2 comments

We evaluate llama using 100 examples of the SQuAD dataset with the Open-evals framework, which extends OpenAI's Evals for different language models. We consider the sentence immediately following the prompt as the output of Llama and useinclude accuracy as a metric to measure its performance.

For a model completion a and a reference list of correct answers B include: any([(a in b) for b in B])

model	squad(100)
alpaca-lora-7b	0.88
llama-7b	0.63
gpt-3.5-turbo	0.9
text-davinci-003	0.87
text-davinci-002	0.66
text-davinci-001	0.58
ada	0.35

Mar 22 '23 03:03 jeff3071

Thank you for sharing your results. Could you share your instruction/template as well?

Apr 25 '23 09:04 OwenNJU

Thank you for asking.

I utilize Open-evals to convert the data into chat format. The format includes a question to be solved based on a given context, followed by a blank space for the response. Here's an example:

Solve the question based on the context.  
Context: {data} 
Assistant:

I found that the answer often repeat the prompt and itself. To fix it, I do some postprocess to get the correct answer by taking the sentence immediately after the repeat prompt.

This project uses <Your llama server ip>/prompt as the default API URL, but you can modify this by editing the evals/plugins/llama.py file.

Apr 27 '23 01:04 jeff3071

Hey @jeff3071 Can you share your script for evaluation? I am trying to replicate the results but there is considerable drop in accuracy.

Oct 23 '23 21:10 Sanchit-404

llama llama copied to clipboard

Share your evaluate result

llama
llama copied to clipboard