llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

The MLX Challenge

Open ggerganov opened this issue 1 year ago • 0 comments

ref https://twitter.com/awnihannun/status/1777072588633882741

This branch starts from the flash-attention branch (#5021, #6508).

To perform a benchmark for the challenge, run:

# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0

make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1

Current numbers on M2 Ultra:

model size params backend ngl threads test t/s
llama 7B Q4_0 3.79 GiB 7.24 B Metal 99 4 tg 128 102.29 ± 0.07

build: 22df85ff (2707)

ggerganov avatar Apr 08 '24 09:04 ggerganov