llama.cpp The MLX Challenge

The MLX Challenge

Open ggerganov opened this issue 1 year ago • 0 comments

ref https://twitter.com/awnihannun/status/1777072588633882741

This branch starts from the flash-attention branch (#5021, #6508).

To perform a benchmark for the challenge, run:

# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0

make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1

Current numbers on M2 Ultra:

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.79 GiB	7.24 B	Metal	99	4	tg 128	102.29 ± 0.07

build: 22df85ff (2707)

Apr 08 '24 09:04 ggerganov

llama.cpp llama.cpp copied to clipboard

The MLX Challenge

llama.cpp
llama.cpp copied to clipboard