Gary Linscott comments

Results 34 comments of


                                            Gary Linscott

Add support for batch size to `--perplexity`

Ok, very interesting result. From https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5397084, there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different...

Add support for batch size to `--perplexity`

@ggerganov awesome! thank you, very nice find. @Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after `batch` tokens?...

Add support for batch size to `--perplexity`

@ggerganov `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` runs out of buffer space with 4870e455b3653f7d7769fa5772b2c90ffad088df. If I go back to 483bab2e3d4a868fe679d8bb32827d2a4df214dc it works well.

Add support for batch size to `--perplexity`

Ok, so after 483bab2e3d4a868fe679d8bb32827d2a4df214dc, I see results are consistent at a given batch_size, but different across batch_sizes. Eg. tested `$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8` with 1...

Add support for batch size to `--perplexity`

@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with `LLAMA_NO_ACCELERATE=1`, but got the same results. It is interesting they switch at batch size...

Add support for batch size to `--perplexity`

I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise,...

Add support for batch size to `--perplexity`

Some very interesting results here. I'm building with `LLAMA_OPENBLAS=1 make -j4` currently. I *think* this means that the batch_size 8 version is not using BLAS, while the larger one is....

Add support for batch size to `--perplexity`

Indeed, hardcoding `ggml_compute_forward_mul_mat_use_blas` to return true results in excellent perplexity, but it's incredibly slow: ``` perplexity : calculating perplexity over 655 chunks, batch_size=16 847.81 seconds per pass - ETA 154.26...

Add support for batch size to `--perplexity`

Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!

Improve cuBLAS performance by dequantizing on the GPU

Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65%...