Add support for batch size to `--perplexity`
[Draft]
I'm seeing a significant difference in output logits when running with batch_size != ctx_size. I've instrumented the code to dump the logits, so I can compare them across batch_size=8, and batch_size=512. The logits match for the first token, but after that, they diverge entirely.
Commands:
./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 >b_512.out
./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 >b_8.out
Diff:
1st logit from second token (and all remaining logits differ as well)
-1.605274 - b_8.out
-1.507442 - b_512.out
Also, the perplexity for the -b 8 version is measurably worse (although not disastrous), so the model quality appears to be getting impacted by the difference, eg. 4.6512 for -b 8 and 4.5970 for b 512.
Ok, very interesting result. From https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5397084, there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different batch sizes now! Interestingly, the 1 thread results both match the batch size 512 result (with 32 threads).
I confirm the observation - looking into this
Edit: The source of the variation looks to be in the "self-attention" section of llama_eval(). Could be some numerical instability.
Edit2: Pretty sure I've pin-pointed the cause:
https://github.com/ggerganov/llama.cpp/blob/9ea43d4d9124d6a05ba1027dd05d65c5ffdfeae7/llama.cpp#L737
This matrix multiplication z = x * y goes through the branch where x (i.e. V_trans) is not contiguous in memory. I.e. we have it transposed on the previous line via the ggml_permute() call. The simple fix is to make a copy into a contiguous buffer. But I want to see if I can find the instability in this branch and try to fix it.
@glinscott Here is a quick fix to not block you:
https://github.com/ggerganov/llama.cpp/pull/439
Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml.
kind of colliding with #438 , also the batch size in your case has the opposite meaning of what it normally (non perplexity mode) does. and thus very deceptive.
@ggerganov awesome! thank you, very nice find.
@Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after batch tokens? Which is what this is intending to do. Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.
@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e455b3653f7d7769fa5772b2c90ffad088df. If I go back to 483bab2e3d4a868fe679d8bb32827d2a4df214dc it works well.
Ok, so after 483bab2e3d4a868fe679d8bb32827d2a4df214dc, I see results are consistent at a given batch_size, but different across batch_sizes.
Eg. tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 with 1 thread, 8 threads, and 32 threads, and always got [1]4.6257.
Then tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 with 1, 8, and 32 and always got [1]4.5690.
Then, did a few more: batch_size=256, threads=32 -> 4.5690 batch_size=64, threads=32 -> 4.5690 batch_size=32, threads=32 -> 4.5690 batch_size=16, threads=32 -> 4.5903 ** first delta batch_size=16, threads=16 -> 4.5903
Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.
if the memory management would actually take it into account.
@ggerganov
./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.rawruns out of buffer space with 4870e45. If I go back to 483bab2 it works well.
see https://github.com/ggerganov/llama.cpp/pull/438 for a competing attempt
@Green-Sky I think I know how to fix the memory issues and reduce token memory usage drastically. Will try to do this later today
@glinscott Are you on a Mac? I think at batch size >= 32 the BLAS Accelerate mul_mat branch is triggered:
https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L5722-L5729
If you disable BLAS with make clean && LLAMA_NO_ACCELERATE=1 make should maybe get the same results?
@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with LLAMA_NO_ACCELERATE=1, but got the same results. It is interesting they switch at batch size 16 vs 32 though.
I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise, I'd probably switch the batch size for people to min(512, context_size) automatically.
Some very interesting results here. I'm building with LLAMA_OPENBLAS=1 make -j4 currently. I think this means that the batch_size 8 version is not using BLAS, while the larger one is. And it results in a huge perplexity decrease! Will have to try switching to the BLAS version even for smaller sizes and see how it goes.
$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16
perplexity : calculating perplexity over 655 chunks, batch_size=8
18.83 seconds per pass - ETA 3.43 hours
[655]6.6016,
$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16 -b 512
perplexity : calculating perplexity over 655 chunks, batch_size=512
12.96 seconds per pass - ETA 2.36 hours
[655]6.2838
Indeed, hardcoding ggml_compute_forward_mul_mat_use_blas to return true results in excellent perplexity, but it's incredibly slow:
perplexity : calculating perplexity over 655 chunks, batch_size=16
847.81 seconds per pass - ETA 154.26 hours
[1]4.3801,
Hi @glinscott, I have tested "hardcoding ggml_compute_forward_mul_mat_use_blas to return true" using IntelMKL:
## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings: load time = 9084.05 ms
llama_print_timings: sample time = 21.13 ms / 40 runs ( 0.53 ms per run)
llama_print_timings: prompt eval time = 16373.49 ms / 16 tokens ( 1023.34 ms per token)
llama_print_timings: eval time = 157398.71 ms / 39 runs ( 4035.86 ms per run)
llama_print_timings: total time = 174428.73 ms
And OpenBLAS:
## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings: load time = 8909.49 ms
llama_print_timings: sample time = 24.11 ms / 40 runs ( 0.60 ms per run)
llama_print_timings: prompt eval time = 16349.56 ms / 16 tokens ( 1021.85 ms per token)
llama_print_timings: eval time = 288016.53 ms / 39 runs ( 7385.04 ms per run)
llama_print_timings: total time = 305045.43 ms
Also, perplexity slightly better ([3]5.8269 vs [3]5.8271)
@glinscott
Yes, using BLAS during perplexity computation can be deceiving (I think I noted this somewhere earlier). I think the explanation is the following:
Let's have matrix multiplication Z = X*Y
Xis 4-bit quantizedYis FP32Zis FP32
When using BLAS, ggml will dequantize X into FP32 and use BLAS's sgemm to do the matrix multiplication.
When not using BLAS, ggml will quantize Y to 4-bit and use the SIMD routines for 4-bit dot product.
I think the BLAS computation will be more precise, because we lose precision when quantizing Y in the latter case.
Additionally, I am not super confident that the current dot product routines accumulate optimally the floating points - I think there might things to improve here.
This PR would be nice to get finalized
Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!