Gary Linscott
Gary Linscott
Ok, very interesting result. From https://github.com/ggerganov/llama.cpp/discussions/406#discussioncomment-5397084, there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different...
@ggerganov awesome! thank you, very nice find. @Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after `batch` tokens?...
@ggerganov `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` runs out of buffer space with 4870e455b3653f7d7769fa5772b2c90ffad088df. If I go back to 483bab2e3d4a868fe679d8bb32827d2a4df214dc it works well.
Ok, so after 483bab2e3d4a868fe679d8bb32827d2a4df214dc, I see results are consistent at a given batch_size, but different across batch_sizes. Eg. tested `$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8` with 1...
@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with `LLAMA_NO_ACCELERATE=1`, but got the same results. It is interesting they switch at batch size...
I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise,...
Some very interesting results here. I'm building with `LLAMA_OPENBLAS=1 make -j4` currently. I *think* this means that the batch_size 8 version is not using BLAS, while the larger one is....
Indeed, hardcoding `ggml_compute_forward_mul_mat_use_blas` to return true results in excellent perplexity, but it's incredibly slow: ``` perplexity : calculating perplexity over 655 chunks, batch_size=16 847.81 seconds per pass - ETA 154.26...
Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!
Wow, this is a game changer! Interestingly, 16 threads and 8 threads seems to be same speed now. Only uses ~600MB of GPU RAM (RTX 3080), and GPU utilization 65%...