llama.cpp Add support for batch size to `--perplexity`

[Draft]

I'm seeing a significant difference in output logits when running with batch_size != ctx_size. I've instrumented the code to dump the logits, so I can compare them across batch_size=8, and batch_size=512. The logits match for the first token, but after that, they diverge entirely.

Commands:

./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 >b_512.out
./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 >b_8.out

Diff:

1st logit from second token (and all remaining logits differ as well)
-1.605274 - b_8.out
-1.507442 - b_512.out

Also, the perplexity for the -b 8 version is measurably worse (although not disastrous), so the model quality appears to be getting impacted by the difference, eg. 4.6512 for -b 8 and 4.5970 for b 512.

Mar 22 '23 19:03 glinscott

Ok, very interesting result. From https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5397084, there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different batch sizes now! Interestingly, the 1 thread results both match the batch size 512 result (with 32 threads).

Mar 23 '23 04:03 glinscott

I confirm the observation - looking into this

Edit: The source of the variation looks to be in the "self-attention" section of llama_eval(). Could be some numerical instability.

Edit2: Pretty sure I've pin-pointed the cause:

https://github.com/ggerganov/llama.cpp/blob/9ea43d4d9124d6a05ba1027dd05d65c5ffdfeae7/llama.cpp#L737

This matrix multiplication z = x * y goes through the branch where x (i.e. V_trans) is not contiguous in memory. I.e. we have it transposed on the previous line via the ggml_permute() call. The simple fix is to make a copy into a contiguous buffer. But I want to see if I can find the instability in this branch and try to fix it.

Mar 23 '23 18:03 ggerganov

@glinscott Here is a quick fix to not block you:

https://github.com/ggerganov/llama.cpp/pull/439

Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml.

Mar 23 '23 18:03 ggerganov

kind of colliding with #438 , also the batch size in your case has the opposite meaning of what it normally (non perplexity mode) does. and thus very deceptive.

Mar 23 '23 20:03 Green-Sky

@ggerganov awesome! thank you, very nice find.

@Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after batch tokens? Which is what this is intending to do. Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

Mar 24 '23 01:03 glinscott

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e455b3653f7d7769fa5772b2c90ffad088df. If I go back to 483bab2e3d4a868fe679d8bb32827d2a4df214dc it works well.

Mar 24 '23 01:03 glinscott

Ok, so after 483bab2e3d4a868fe679d8bb32827d2a4df214dc, I see results are consistent at a given batch_size, but different across batch_sizes.

Eg. tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 with 1 thread, 8 threads, and 32 threads, and always got [1]4.6257.

Then tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 with 1, 8, and 32 and always got [1]4.5690.

Then, did a few more: batch_size=256, threads=32 -> 4.5690 batch_size=64, threads=32 -> 4.5690 batch_size=32, threads=32 -> 4.5690 batch_size=16, threads=32 -> 4.5903 ** first delta batch_size=16, threads=16 -> 4.5903

Mar 24 '23 02:03 glinscott

Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

if the memory management would actually take it into account.

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e45. If I go back to 483bab2 it works well.

see https://github.com/ggerganov/llama.cpp/pull/438 for a competing attempt

Mar 24 '23 02:03 Green-Sky

@Green-Sky I think I know how to fix the memory issues and reduce token memory usage drastically. Will try to do this later today

@glinscott Are you on a Mac? I think at batch size >= 32 the BLAS Accelerate mul_mat branch is triggered:

https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L5722-L5729

If you disable BLAS with make clean && LLAMA_NO_ACCELERATE=1 make should maybe get the same results?

Mar 24 '23 05:03 ggerganov

@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with LLAMA_NO_ACCELERATE=1, but got the same results. It is interesting they switch at batch size 16 vs 32 though.

Mar 24 '23 14:03 glinscott

I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise, I'd probably switch the batch size for people to min(512, context_size) automatically.

Apr 03 '23 03:04 glinscott

Some very interesting results here. I'm building with LLAMA_OPENBLAS=1 make -j4 currently. I think this means that the batch_size 8 version is not using BLAS, while the larger one is. And it results in a huge perplexity decrease! Will have to try switching to the BLAS version even for smaller sizes and see how it goes.

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16
perplexity : calculating perplexity over 655 chunks, batch_size=8
18.83 seconds per pass - ETA 3.43 hours 
[655]6.6016,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16 -b 512  
perplexity : calculating perplexity over 655 chunks, batch_size=512
12.96 seconds per pass - ETA 2.36 hours   
[655]6.2838

Apr 03 '23 22:04 glinscott

Indeed, hardcoding ggml_compute_forward_mul_mat_use_blas to return true results in excellent perplexity, but it's incredibly slow:

perplexity : calculating perplexity over 655 chunks, batch_size=16
847.81 seconds per pass - ETA 154.26 hours
[1]4.3801,

Apr 04 '23 04:04 glinscott

Hi @glinscott, I have tested "hardcoding ggml_compute_forward_mul_mat_use_blas to return true" using IntelMKL:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  9084.05 ms
llama_print_timings:      sample time =    21.13 ms /    40 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time = 16373.49 ms /    16 tokens ( 1023.34 ms per token)
llama_print_timings:        eval time = 157398.71 ms /    39 runs   ( 4035.86 ms per run)
llama_print_timings:       total time = 174428.73 ms

And OpenBLAS:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  8909.49 ms
llama_print_timings:      sample time =    24.11 ms /    40 runs   (    0.60 ms per run)
llama_print_timings: prompt eval time = 16349.56 ms /    16 tokens ( 1021.85 ms per token)
llama_print_timings:        eval time = 288016.53 ms /    39 runs   ( 7385.04 ms per run)
llama_print_timings:       total time = 305045.43 ms

Also, perplexity slightly better ([3]5.8269 vs [3]5.8271)

Apr 04 '23 22:04 ivanstepanovftw

@glinscott

Yes, using BLAS during perplexity computation can be deceiving (I think I noted this somewhere earlier). I think the explanation is the following:

Let's have matrix multiplication Z = X*Y

X is 4-bit quantized
Y is FP32
Z is FP32

When using BLAS, ggml will dequantize X into FP32 and use BLAS's sgemm to do the matrix multiplication. When not using BLAS, ggml will quantize Y to 4-bit and use the SIMD routines for 4-bit dot product.

I think the BLAS computation will be more precise, because we lose precision when quantizing Y in the latter case. Additionally, I am not super confident that the current dot product routines accumulate optimally the floating points - I think there might things to improve here.

Apr 05 '23 14:04 ggerganov

This PR would be nice to get finalized

Apr 13 '23 12:04 ggerganov

Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!

Apr 13 '23 15:04 glinscott