llama.cpp
llama.cpp copied to clipboard
The MLX Challenge
ref https://twitter.com/awnihannun/status/1777072588633882741
This branch starts from the flash-attention branch (#5021, #6508).
To perform a benchmark for the challenge, run:
# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0
make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1
Current numbers on M2 Ultra:
| model | size | params | backend | ngl | threads | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.79 GiB | 7.24 B | Metal | 99 | 4 | tg 128 | 102.29 ± 0.07 |
build: 22df85ff (2707)