llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Quantitative measurement of model perplexity for different models and model quantization modes

Open noughtmare opened this issue 1 year ago • 49 comments

llama.cpp seems to give bad results compared to Facebook's implementation.

Here's an example simple reading comprehension prompt:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book

LLaMA 7B with Facebook's implementation yields:

Seed 1:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures." Asked by lone wolf 1788 days ago.

Seed 2 (to show that the above is not just a fluke):

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with pictures." Question: "Tom, Mark, and Paul bought books: two with pictures and

While llama.cpp without quantization (so still float16) generates (with --seed 0 -t 8):

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that is neither as good nor bad." This solution breaks down the problem into its simple parts; then using those components we can see what each component means by itself, in order to solve this logic puzzle. 1) Tom and Mark had different kinds of books...and so did Paul! (Therefore one out three were without pictures). ... [end of text]

It even has a grammatical error at the end: "one out [of] three"

As you can see the quality of 7B is higher in Facebook's implementation. So, I think you may still have bugs in your implementation or the default parameters could be improved.

noughtmare avatar Mar 14 '23 12:03 noughtmare

With quantization the result is also bad:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book WITHOUT PICTURES." This is just an example question; I can't figure out how to post actual questions that are being asked in my classroom! So, here we go . . . We recently had our first day back after Thanksgiving break. One of the things

noughtmare avatar Mar 14 '23 12:03 noughtmare

You might not be comparing apples to apples. e.g. are the --top_p and other parameters identical between implementations?

gjmulder avatar Mar 14 '23 13:03 gjmulder

I'm using the default settings, so for the Python code it is:

    temperature: float = 0.8,
    top_p: float = 0.95,

And for llama.cpp:

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

So I think only the repeat penalty and top_k could be different?

noughtmare avatar Mar 14 '23 13:03 noughtmare

If I disable the repeat penalty (I assume --repeat_penalty 0 does that) then I still get low quality results:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book about cars (p1). He chose to read and look "up Hepa", with only p3 one "out did it one time with his different". The time look only only What look and time about a a it about Answer and Mark What?out his one pout the car it. Tom' his bought

repeat_penalty 1 gives me a more sensible but still wrong result:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book without pictures. Paul had a different kind of book than Tom and Mark." Question: "Jeffrey, Jennifer, and Jackson bought books: three with pictures and one without. Jackson and Jennifer had different kinds of books. What kind did Jeffrey buy?" Answer: "Jeffrey bought a book

(These results are from the quantized 7B model with seed 0 and default parameters except for the repeat penalty)

noughtmare avatar Mar 14 '23 13:03 noughtmare

top_k looks to be currently broken, as I recently reported in issue #56. I just now realised that due to #95 identical seeds across implementations are unlikely to produce identical results as per @ggerganov's correction to my comment in that issue.

It does then look like llama.cpp is of lower quality. You've tried other prompts and got similar results?

gjmulder avatar Mar 14 '23 13:03 gjmulder

I haven't tested the Python implementation extensively, because Facebook's implementation takes very long to run on my CPU. But I generally feel that running 7B and even 13B with llama.cpp gives results that are below the quality that Facebook has claimed.

noughtmare avatar Mar 14 '23 14:03 noughtmare

Try the following parameters, gives me good quality output:

--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647

Also repeat_penalty = 1.0 means disable. Maybe its not named as it should be 😇

beiller avatar Mar 14 '23 14:03 beiller

If 1 means disable whats the point of higher than 1 values? Also its good to let it repeat itself a little, sometimes that makes sense in conversation, but tighter lets it break loops before they begin.

Urammar avatar Mar 14 '23 14:03 Urammar

Try the following parameters

Still gives me a wrong result with the quantized model:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book with no pictures." Answer to Question 1739: "The three students were going on an outing. They needed shoes for the trip. Each student owned a pair of shoes that was not his own. Which student wore tennis shoes? (Hint: The answer is in the question.)

With the fp16 model it is also wrong:

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not in the picture." "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not in the picture." [end of text]

I think the problem is more fundamental than just a change of the parameters.

noughtmare avatar Mar 14 '23 14:03 noughtmare

I haven't tested the Python implementation extensively, because Facebook's implementation takes very long to run on my CPU. But I generally feel that running 7B and even 13B with llama.cpp gives results that are below the quality that Facebook has claimed.

It may be simply a case of the project management triangle, i.e. choose any two of:

  1. Performance
  2. Quality
  3. Self-hosting

gjmulder avatar Mar 14 '23 14:03 gjmulder

That might be so, but I don't see an obvious reason why the quality would be lower. Quantization could have been a logical cause, but I think I have shown that even the fp16 model has a lower quality.

noughtmare avatar Mar 14 '23 14:03 noughtmare

If its simply a straight up c++ implementation then it should be the same, but an install step in the github states it must be quantized, which means even if you are running it in fp16 its still been crunched in precision to run better, which naturally means its outputs will slightly differ.

You wouldnt expect a mile long road at 18.2 degrees to end up at the same place as one rebuilt at 18.0 degrees, right?

As you just said as I was typing this, quantization made its brain just that little more crispy, and that clearly slightly effects it. Thats probably not solvable.

Urammar avatar Mar 14 '23 14:03 Urammar

but an install step in the github states it must be quantized,

I don't think that step is required. The model runs fine without the quantization step. And the readme also claims llama.cpp has "Mixed F16/F32 precision". Edit: there's an example of running without quantization here: https://github.com/ggerganov/llama.cpp/issues/2#issuecomment-1464615286

noughtmare avatar Mar 14 '23 15:03 noughtmare

@Urammar higher than 1 starts to penalize the predicted next token if it occurred in previous N tokens. It will multiply the likelihood by 1/penalty.

beiller avatar Mar 14 '23 15:03 beiller

Try like so:

./main -m ./models/13B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -p $'Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.\n\nQuestion: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book'

Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that was not illustrated."

EDIT Love how my brain failed at interpreting this let me try larger model.

beiller avatar Mar 14 '23 15:03 beiller

For me it consistently answers incorrectly every time

Question: "Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: Three. Two ducks are in front of the last duck; the first duck has two ducks behind; one duck is between the other two.

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book without pictures."

beiller avatar Mar 14 '23 15:03 beiller

Haha, that question about ducks is also interesting. Using this prompt:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are

The Python implementation outputs a plausible answer:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are seven ducks." The answer is correct, but it is not obvious. Try to explain your answer

But llama.cpp 7B FP16 outputs garbage:

Question: "There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?" Answer: "There are three duc... I'm sorry for your loss, but I think it is fair to say that you have moved on from this traumatic event by now. [end of text]

noughtmare avatar Mar 14 '23 15:03 noughtmare

I get consistently non-garbage output. Can you try using the settings I had above? I am on a different branch. Wonder if that has anything to do with it.

beiller avatar Mar 14 '23 15:03 beiller

I mean it even explains itself

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book that has no pictures." The correct answer is A because the question says they all bought books but it doesn't say which ones so B isn't right becuase you don't know if tom or mark got the same thing as paul. C can be eliminated becuase there are only three choices left to choose from. D can also be eliminated becuase again you have to chose between 3 things not four. So your left with A and E becuase those are the only two options left. [end of text]

I'm not sure if this is the best way to objectively tell the quality of the output :)

EDIT cmd line params:

./main -m ./models/7B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p $'Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book'

beiller avatar Mar 14 '23 15:03 beiller

Actually, after playing around a bit with the quantized model, I now believe that the problem is only in running the FP16 model. The quantized model seems to work much better for me.

noughtmare avatar Mar 14 '23 19:03 noughtmare

Thanks for sharing your parameters guys, I get definitely better results as with the default ones.

I ran the same prompt with 5 different models 7B/Q4, 7B/16F, 13B/Q4, 13B/16F, 30B/Q4:

./main -m ./models<model>.bin -t 8 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p 'Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"'

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

See the results below.

I ran each one multiple times, the results of the single runs with one model are comparable in quality. I don't see a major quality difference in Q4 and F16. Interestingly, 13B gave me the weirdest results. It also always tempted to return some kind of LaTex code, I could also observe this with other prompts.

Results

7B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
Answer: "Paul bought the book that didn't have a picture."

7B / F16

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
Answer: A book that was not a picture book. [end of text]

13B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"

I have no idea how to solve this problem. I've tried it a few ways but can't seem to get the right answer.

Comment: You need to add more information about what you know or don't know about these people/books etc...

Answer: \begin{blockquote}

\strong{Hint:} There are three possibilities for each book type (with picture and without).  So there are $3^2$ possible combinations for all six books combined.  The number of permutations is $\frac{(6!)}{(3!)^2}=15$.
\end{blockquote} [end of text]

13B / F16

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"

\begin{blockquote}

Tom, Mark, and Paul bought books: two with pictures and one without.  Tom and Mark had different kinds of books.  What kind did Paul buy?
\end{blockquote}

I'm not sure how to approach this problem. I know that there are three possible answers (pictures or no picture) but the question doesn't specify which is correct so it could be any combination. Any help would be appreciated! Thanks in advance :)

Comment: This isn't a maths question - it belongs on [Puzzling](http://puzzling.stackexchange.com/).

Answer: The answer is that \strong{Paul has the same book as Tom}. If you assume that they all have different types of books then we can deduce from the information given that at least one person must have a book with pictures because otherwise only one type of book exists. So either Tom or Mark buys such a book. But if both do then they cannot have different types of books. Therefore Paul also has a book with pictures. [end of text]

30B / Q4

Question: "Tom, Mark, and Paul bought books: two with pictures and one without. Tom and Mark had different kinds of books. What kind did Paul buy?"
The answer is given as 2, but I don't understand why it isn't 1?
If we know that the book without a picture was purchased by Paul, then wouldn't we also have to assume that he didn't purchase any other books (because if he did, then there would be more than three total)? And if we can make this assumption, doesn't that mean that Tom and Mark must have both purchased books with pictures in them?
I think you are right. The question is poorly worded. It should say something like "Mark and Paul each bought only one book." Then it would make sense. [end of text]

sburnicki avatar Mar 15 '23 09:03 sburnicki

@sburnicki I think it is better to include the Answer: "Paul bought a book in your prompt to avoid the cases where the model doesn't give a straight answer like almost all your results do.

Also, in hindsight I think I should have worded it slightly differently:

Question: "Tom, Mark, and Paul each bought a book. Together they bought two with pictures and one without. Tom and Mark bought different kinds of books. What kind did Paul buy?" Answer: "Paul bought a book

noughtmare avatar Mar 15 '23 10:03 noughtmare

One bug I found is #173 llama.cpp seems to use a different norm method.

hoangmit avatar Mar 15 '23 20:03 hoangmit

So I think the quality of the generations is difficult to evaluate. We need a more quantitative metric. I think we will put more efforts into that once we have the Q4_1 quantization ready.

ggerganov avatar Mar 15 '23 20:03 ggerganov

We need a more quantitative metric.

This blog post did a quantitative measurement of quality for comparison against different quantification methods, though I don't know how well it corresponds to subjective quality. Code is here though I'm not sure if it includes the quality measurement code. There is also this project which does include measurement code.

I expect one of these could serve as a starting point for building an objective metric for quality.

bakkot avatar Mar 15 '23 23:03 bakkot

Here's more evidence for the low quality. Prompt:

Here's a Python program that computes fibonacci numbers: def fib(n):   

(The last line has two spaces)

Python LLaMA 7B

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 1:
      return 1
   elif n == 2:
      return 1
   else:
      return fib(n - 1) + fib

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n==0:
      return 0
   elif n==1:
      return 1
   else:
      return fib(n-1) + fib(n-

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1:
       return n
   else:
       return fib(n-1) + fib(n-2)
print(fib

llama.cpp 7B F16

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   x=0 #initialization
   for i in range(1,n+1):
       y=x+(x*i) % n  #modulo operation with remainder
       x=y
   return x    #return the answer.

print("The Fibonacci number is

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n <= 1:
      return 1
   else:
      return fib(n-1) + fib(n-2)

for i in xrange(200,100000):
   print "Fib(%i): %i"% (

(This one is pretty good)

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1:
      return 0, None
   else:
      # we use 'or' instead of 'and' to be sure the loop doesn't exit too soon
      for i in range (2, n+1):
         a,b

llama.cpp 7B Q4_0

seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   result = 1
   for i in range(2, n+1):
       if i%2 == 0:
          result += i/2
       else:
          result += i*fib(i-1)
    return result
# for example, the answer to fib

seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n < 2: return 1
   elif n == 2: return 0
   else: return fib(n-1) + fib(n-2)
def fibonacci(n):
   if n < 3: return 0
   elif n < 6

(This one would be correct if not for that elif n == 2: return 0)

seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0 or n == 1: return 1
return n - (fib(n-1) + fib(n-2))
I was reading through the code and wondering what the point of fib(0) was. Is it just to prevent an index out of bounds exception when

results

  • Python: 3/3
  • F16: 1/3
  • Q4_0: 0/3

noughtmare avatar Mar 16 '23 11:03 noughtmare

I could agree that quantized model shows loq quality of answers.

write a function to generate fibonacci numbers as a list

def fib(n):  # n is the number of terms in the sequence.
    if (n == 0) or (n == 1): return [0, 1]
    else: return fib(n-1)+fib(n-2)

how does it work?

The program will take an input sentence and convert it into its corresponding Morse code signal using a lookup table. It then outputs that Morse code signal along with the original text. The output can be displayed on screen or printed out for later reference.

i didn't ask about morse code, i asked about fibonacci function

Fibonacci functions are mathematical sequences defined by the following recurrence relation:

a_n = a_{n - 1} + a_{n - 2}, where "a_" stands for Fibonacci sequence at index ___.

For example, given a=0, b=1, we have:

a_0 = 0;
a_1 = 1;
a_2 = 1;
a_3 = 2;
a_4 = 3;

continue

continue

thepowerfuldeez avatar Mar 19 '23 18:03 thepowerfuldeez

@noughtmare and others

Thank you very much for these investigations. This is very useful, as I don't have the Python codebase setup to run on my M1.

It is very possible that there is some subtle bug in llama.cpp. The other explanation is that the sampling strategy is somehow different.

I think the ggml FP16 should produce similar results to the original code. If it is not the case, we have to figure out what is the issue.

I tried running the FP16 with the following parameters and it seems to usually perform well on the fibonacci task:

./main -m models/7B/ggml-model-f16.bin -p "Here's a Python program that computes fibonacci numbers:
def fib(n):
  " --top_k 10000 --temp 0.96 --repeat_penalty 1

Can you confirm that is the case for you as well?

ggerganov avatar Mar 19 '23 19:03 ggerganov

Can we rule out the tokenizer? I can't test at the moment, but there is another issue claiming the tokenization output differs between implementations.

beiller avatar Mar 19 '23 20:03 beiller

Can you confirm that is the case for you as well?

I get these. --seed 1:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   res=0;
   for i in range(2,n+2):
       res=res+i;
   return res

--seed 2:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n==0:
      return 0
   if n==1:
      return 1
   return fib(n-1) + fib(n-2)

--seed 3:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0:
      return 0
   elif n == 1:
      return 1
   else:
      return fib(n-1) + fib(n-2)

--seed 4:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   result = [0, 1]
   for i in range(1, n):
      result[0] = result[1] + result[0]
      result[1] = result[0]
   return result

--seed 5:

Here's a Python program that computes fibonacci numbers:

def fib(n):
   if n == 0:
      print 'fib(0) = 0'
      return 0
   else:
   print 'fib(1) = 1'
   print 'fib(2) = 1'
   print 'fib(3) = 2'
   print 'fib(4) = 3'
   n = n-1
   return fib(n-1)+fib(n-2)

So only 2/5 are correct programs. I've also done the first 20 seeds of which it got 10/20 (50%) correct.

I've just reran this prompt on the Python implementation and it got 14/20 seeds (70%) correct.

noughtmare avatar Mar 20 '23 08:03 noughtmare