llama.cpp benchmarks?

Where are the benchmarks for various hardware - eg. apple silicon

Mar 12 '23 05:03 brappier

M1 with 7B model: 94.24 ms per token M1 with 13B model: 202.18 ms per token

speed with command line config -t 4. If use -t 8, half speed.

Mar 12 '23 05:03 wizd

Using command line config -t 8, note this is in a VM assigned 42 logical cores out of the total 44, other services running on the server. AMD EPYC 7443P 7B: 89.39 ms per token

Mar 12 '23 09:03 ElRoberto538

M1 Pro 32GB, 30B model:

main: mem per token = 43387780 bytes main: load time = 10701.85 ms main: sample time = 279.92 ms main: predict time = 37065.80 ms / 226.01 ms per token main: total time = 51992.27 ms

Mar 12 '23 15:03 MLTQ

Macbook Pro 2013, Intel i5, 2 cores, 8 GB RAM 7B 4bit model main: mem per token = 14335844 bytes main: load time = 8224.30 ms main: sample time = 1918.08 ms main: predict time = 308737.91 ms / 604.18 ms per token main: total time = 331646.62 ms

Thank you for this awesome project.

Mar 12 '23 17:03 diimdeep

Ryzen 7 3700X, 128GB RAM @ 3200, llama.cpp numbers:

$ ./main -m models/7B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 14434244 bytes
main:     load time =  1270.15 ms
main:   sample time =   325.76 ms
main:  predict time = 15147.15 ms / 117.42 ms per token
main:    total time = 17077.88 ms

$ ./main -m models/13B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 22439492 bytes
main:     load time =  2946.00 ms
main:   sample time =    86.11 ms
main:  predict time =  7358.48 ms / 216.43 ms per token
main:    total time = 11019.28 ms

$ ./main -m models/30B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 43387780 bytes
main:     load time =  6666.53 ms
main:   sample time =   332.71 ms
main:  predict time = 68779.27 ms / 533.17 ms per token
main:    total time = 77333.97 ms

$ ./main -m models/65B/ggml-model-q4_0.bin -t 8 -n 128
main: mem per token = 70897348 bytes
main:     load time = 14010.35 ms
main:   sample time =   335.09 ms
main:  predict time = 140527.48 ms / 1089.36 ms per token
main:    total time = 157951.48 ms

With the 30B model, a RTX 3090 manages 15 tokens/s using text-generation-webui

Mar 12 '23 22:03 neuhaus

1.2 tokens/s on a Samsung S22 Ultra running 4 threads.

The S22 obviously has a more powerful processor. But I do not think it is 12 times more powerful. It's likely you could get much faster speeds on the Pi.

I'd be willing to bet that the bottleneck is not the processor.

Reposting the 1.2 token/second Samsung S22 Ultra result here. (Originally posted in https://github.com/ggerganov/llama.cpp/issues/58)

Mar 13 '23 18:03 MarkSchmidty

I must say this running on my phone at all was surprising. Here are my results on an 8+gen1 for 4bit 7B

and results for my desktop with 13900k and 64gb ddr5 4bit quant 7B

main: mem per token = 14434244 bytes
main:     load time =   609.88 ms
main:   sample time =    36.60 ms
main:  predict time =  9487.02 ms / 71.33 ms per token
main:    total time = 10341.46 ms

full precision 7B

main: mem per token = 14434244 bytes
main:     load time = 26905.18 ms
main:   sample time =    37.78 ms
main:  predict time = 23033.74 ms / 173.19 ms per token
main:    total time = 50204.95 ms

4bit quant 65B

main: mem per token = 70897348 bytes
main:     load time = 83233.36 ms
main:   sample time =    36.90 ms
main:  predict time = 86000.03 ms / 646.62 ms per token
main:    total time = 172458.39 ms

Edit: Did something really stupid and ran 4bit 13B on my phone. TLDR its slow, dont. (unless you have lots of ram) My phone has 12gb ram and 7gb of manually added swap. I had to run it through an adb root shell instead of termux as the android memory manager would kill termux as soon as the model started to load. The downside to this approach is that everything else on my phone is killed meaning I couldnt even get the screen to turn on while inference was running

main: mem per token = 22357508 bytes
main:     load time = 29320.15 ms
main:   sample time =  2254.09 ms
main:  predict time = 5227881.50 ms / 39307.38 ms per token
main:    total time = 5335562.00 ms

Mar 13 '23 22:03 ItsLogic

Here is my quick look at 2x Intel Xeon Gold 5120 @ 2.20GHz, march=native

7B

main: mem per token = 14762244 bytes
main:     load time =  3378.15 ms
main:   sample time =    15.87 ms
main:  predict time =  4494.55 ms / 115.24 ms per token
main:    total time =  8328.48 ms

7B fp16

main: mem per token = 14532644 bytes
main:     load time = 27977.19 ms
main:   sample time =    24.71 ms
main:  predict time =  9378.29 ms / 275.83 ms per token
main:    total time = 38135.22 ms

13B

main: mem per token = 22562468 bytes
main:     load time = 16860.55 ms
main:   sample time =   170.45 ms
main:  predict time = 56121.11 ms / 308.36 ms per token
main:    total time = 74377.55 ms

13B fp16

main: mem per token = 22562468 bytes
main:     load time = 64448.62 ms
main:   sample time =   129.29 ms
main:  predict time = 61505.41 ms / 455.60 ms per token
main:    total time = 127347.54 ms

30B

main: mem per token = 43547620 bytes
main:     load time = 51269.82 ms
main:   sample time =    49.77 ms
main:  predict time = 41543.11 ms / 585.11 ms per token
main:    total time = 95383.98 ms

65B

main: mem per token = 71553028 bytes
main:     load time = 99438.78 ms
main:   sample time =    44.94 ms
main:  predict time = 69203.49 ms / 1017.70 ms per token
main:    total time = 218532.06 ms

This is with 14 / 28 threads. Running with 56 threads slows it down, probably NUMA. I think 115ms is still a good result for this CPU.

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model? The answer is no.

Mar 16 '23 02:03 totoCZ

So if anyone like me was wondering, does having a million cores in a server CPU give you a 65B model?

It's clear by now that llama.cpp speed mostly depends on max single core performance for comparisons within the same CPU architecture, up to a limit where all CPUs of the same architecture perform approximately the same. Memory bandwidth and memory bus chokepoints appear to be the major bottlenecks after that point.

Using more cores can slow things down for two reasons:

More memory bus congestion from moving bits between more places. llama.cpp is well written and easily maxes out the memory bus on most even moderately powerful systems.
Reducing your effective max single core performance to that of your slowest cores. This is usually the primary culprit on 4 or 6 core devices (mostly phones) which often have 2 power cores and then 2-4 balanced and/or "efficiency" cores.

With these learnings in mind, it would be good to see benchmark results from anyone who manages to find some yet unknown optimization in their configuration, OS environment, or hardware environment.

Mar 16 '23 09:03 MarkSchmidty

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  8892.24 ms
main:   sample time =  1988.34 ms
main:  predict time = 270018.50 ms / 2093.17 ms per token
main:    total time = 287685.50 ms

2+s per token! I get similar with the 4 bit quant, if not worse.

Edit: Running with -m ./models/7B/ggml-model-q4_0.bin -t 12 -n 128

main: mem per token = 14499844 bytes
main:     load time =  1631.32 ms
main:   sample time =  1513.06 ms
main:  predict time = 574477.00 ms / 6047.13 ms per token
main:    total time = 596436.75 ms

Mar 19 '23 16:03 hanvyj

How are you getting such good performance?

I'm running an i7 10750h 32gig ram with -m ./models/7B/ggml-model-f16.bin -t 12 -n 128

Try:

less threads. your cpu seems to only have 6 real cores. llama.cpp seems to scale poorly with threads.
tell us your system info line for more context eg: system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
make sure you compile with all optimizations

Mar 19 '23 16:03 Green-Sky

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

Apr 10 '23 23:04 xportz

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

https://onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

Apr 11 '23 14:04 jon-chuang

Just curious, why was llama.cpp invented when you can run the models on onnxruntime with CPU backend? Could someone make a comparison of relative performance at the same quantization level and also show the perplexity over a validation set?

I guess people just prefer the no-dependency route? But it seems like reinventing the wheel or reimplementing optimized code?

onnxruntime.ai/docs/build/inferencing.html

EDIT: I guess one significant advantage is 4-bit quantization which results in significant memory savings over 8-bit. But how does this affect perplexity?

The effect of 4bit on perplexity is negligible thanks to GPTQ quantization, act order, and binning.

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

Apr 12 '23 00:04 MarkSchmidty

4bit is twice as fast as 8bit because llama.cpp is efficient enough to be memory bound, not compute bound, even on modest processors. I have not seen comparisons of ONNX CPU speeds to llama.cpp for the same quantization level, but Hugging Face Transformers is roughly 20x slower than llama.cpp. I suspect ONNX is about as efficient as HF Transformers.

How important is CPU cache size to llama.cpp's performance? Do llama's memory access patterns cause the cache to be evicted often (naive me assumes yes but I really don't know).

Apr 12 '23 03:04 clulece

How important is CPU cache size to llama.cpp's performance?

A: doesn't seem super important: https://github.com/ggerganov/llama.cpp/pull/778

Apr 12 '23 03:04 jon-chuang

How do you guys get these benchmarking results? When I CTRL+C out of the program, I get no output. Thanks.

i think you can do it with --mtest parameter

Apr 13 '23 04:04 ridwanarf25

Wish me luck, Imm running 65B with 6 cores nd 32 gigs of ram

Apr 18 '23 16:04 raghav-deepsource

@raghav-deepsource luck is what you need. you need at least ~60gigs of ram for the 65B model. :)

Apr 18 '23 17:04 Green-Sky

Got it chugging at about 30 seconds per token with "recite the alphabet backwards". Interestingly, my memory usage didn't go up by much. feels like the code may be paging the weights into memory to reduce usage or something

Apr 19 '23 05:04 raghav-deepsource

CPU: E5-2680v4 MEM: 64GB

$ ./build/bin/Release/main.exe -m ./models/65B/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 22915.12 ms llama_print_timings: sample time = 76.15 ms / 128 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 4425.61 ms / 2 tokens ( 2212.81 ms per token) llama_print_timings: eval time = 176678.85 ms / 127 runs ( 1391.17 ms per run) llama_print_timings: total time = 199672.21 ms

Apr 21 '23 11:04 ai-rex

$ ./build/bin/Release/main.exe -m ./models/llama-7B-ggml-int4/ggml-model-q4_0.bin -t 14 -n 128

system_info: n_threads = 14 / 28 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings: load time = 2677.89 ms llama_print_timings: sample time = 75.61 ms / 128 runs ( 0.59 ms per run) llama_print_timings: prompt eval time = 225.42 ms / 2 tokens ( 112.71 ms per token) llama_print_timings: eval time = 19808.81 ms / 127 runs ( 155.97 ms per run) llama_print_timings: total time = 22564.25 ms

Apr 21 '23 11:04 ai-rex

M1 Max, maxed GPU, 64 GB.

Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s

10 or so Safari tabs in the background, ~6-10% idle CPU consumption observed before start of test. Model

Script: https://gist.github.com/kiratp/18826c1c085acf732f480e726b32686c Edited from @KASR 's script https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

cmd = "./main \
     --seed 147369852 \
     --threads {threads} \
     --n_predict 128 \
     --model ./models/7B/ggml-model-q4_0.bin \
     --top_k 40 \
     --top_p 0.9 \
     --temp 0.5 \
     --repeat_last_n 64 \
     --repeat_penalty 1.1 \
     -p \"Write a funny joke:\" \
     --ignore-eos"

Running with 1 threads...
	 1 threads | run 1/3 | current token time 199.07 ms - eval time 24809.17 ms - prompt eval time 1592.53 ms
	 1 threads | run 2/3 | current token time 198.85 ms - eval time 24866.71 ms - prompt eval time 1590.83 ms
	 1 threads | run 3/3 | current token time 198.93 ms - eval time 24866.36 ms - prompt eval time 1591.47 ms
Running with 2 threads...
	 2 threads | run 1/3 | current token time 102.17 ms - eval time 12880.66 ms - prompt eval time 817.39 ms
	 2 threads | run 2/3 | current token time 102.09 ms - eval time 12880.23 ms - prompt eval time 816.71 ms
	 2 threads | run 3/3 | current token time 102.05 ms - eval time 12888.98 ms - prompt eval time 816.39 ms
Running with 3 threads...
	 3 threads | run 1/3 | current token time 71.74 ms - eval time 8931.11 ms - prompt eval time 573.96 ms
	 3 threads | run 2/3 | current token time 71.65 ms - eval time 8948.05 ms - prompt eval time 573.17 ms
	 3 threads | run 3/3 | current token time 71.31 ms - eval time 8933.5 ms - prompt eval time 570.51 ms
Running with 4 threads...
	 4 threads | run 1/3 | current token time 54.97 ms - eval time 6944.32 ms - prompt eval time 439.75 ms
	 4 threads | run 2/3 | current token time 54.81 ms - eval time 7153.19 ms - prompt eval time 438.51 ms
	 4 threads | run 3/3 | current token time 54.75 ms - eval time 7073.57 ms - prompt eval time 437.97 ms
Running with 5 threads...
	 5 threads | run 1/3 | current token time 46.04 ms - eval time 6177.01 ms - prompt eval time 368.34 ms
	 5 threads | run 2/3 | current token time 46.33 ms - eval time 6168.68 ms - prompt eval time 370.61 ms
	 5 threads | run 3/3 | current token time 47.62 ms - eval time 6172.55 ms - prompt eval time 380.94 ms
Running with 6 threads...
	 6 threads | run 1/3 | current token time 39.43 ms - eval time 5563.91 ms - prompt eval time 315.41 ms
	 6 threads | run 2/3 | current token time 39.38 ms - eval time 5543.76 ms - prompt eval time 315.03 ms
	 6 threads | run 3/3 | current token time 39.42 ms - eval time 5599.16 ms - prompt eval time 315.39 ms
Running with 7 threads...
	 7 threads | run 1/3 | current token time 34.34 ms - eval time 5676.61 ms - prompt eval time 274.74 ms
	 7 threads | run 2/3 | current token time 34.48 ms - eval time 5688.08 ms - prompt eval time 275.81 ms
	 7 threads | run 3/3 | current token time 34.19 ms - eval time 5681.7 ms - prompt eval time 273.52 ms
Running with 8 threads...
	 8 threads | run 1/3 | current token time 33.95 ms - eval time 5394.02 ms - prompt eval time 271.57 ms
	 8 threads | run 2/3 | current token time 33.29 ms - eval time 5358.99 ms - prompt eval time 266.32 ms
	 8 threads | run 3/3 | current token time 32.22 ms - eval time 5311.68 ms - prompt eval time 257.74 ms
Running with 9 threads...
	 9 threads | run 1/3 | current token time 87.65 ms - eval time 15074.75 ms - prompt eval time 701.22 ms
	 9 threads | run 2/3 | current token time 88.11 ms - eval time 13013.74 ms - prompt eval time 704.86 ms
	 9 threads | run 3/3 | current token time 85.37 ms - eval time 12599.68 ms - prompt eval time 682.97 ms
Running with 10 threads...
	 10 threads | run 1/3 | current token time 114.17 ms - eval time 17767.65 ms - prompt eval time 913.38 ms
	 10 threads | run 2/3 | current token time 107.66 ms - eval time 17790.2 ms - prompt eval time 861.27 ms
	 10 threads | run 3/3 | current token time 103.85 ms - eval time 16773.97 ms - prompt eval time 830.81 ms

Llama scaling

May 01 '23 00:05 kiratp

Threadripper 3990x with 256 GB

You can see where the memory bandwidth/contention becomes the bottleneck

Running with 32 threads...
         32 threads | run 1/3 | current token time 21.28 ms - eval time 9901.05 ms - prompt eval time 170.26 ms
         32 threads | run 2/3 | current token time 21.95 ms - eval time 10361.13 ms - prompt eval time 175.6 ms
         32 threads | run 3/3 | current token time 21.57 ms - eval time 9927.76 ms - prompt eval time 172.56 ms
Running with 40 threads...
         40 threads | run 1/3 | current token time 20.67 ms - eval time 10545.29 ms - prompt eval time 165.33 ms
         40 threads | run 2/3 | current token time 20.07 ms - eval time 10493.8 ms - prompt eval time 160.58 ms
         40 threads | run 3/3 | current token time 20.25 ms - eval time 10652.63 ms - prompt eval time 162.03 ms
Running with 48 threads...
         48 threads | run 1/3 | current token time 19.58 ms - eval time 10747.09 ms - prompt eval time 156.62 ms
         48 threads | run 2/3 | current token time 19.51 ms - eval time 10547.48 ms - prompt eval time 156.1 ms
         48 threads | run 3/3 | current token time 20.05 ms - eval time 11197.02 ms - prompt eval time 160.44 ms
Running with 56 threads...
         56 threads | run 1/3 | current token time 20.24 ms - eval time 11720.33 ms - prompt eval time 161.93 ms
         56 threads | run 2/3 | current token time 19.58 ms - eval time 11301.06 ms - prompt eval time 156.68 ms
         56 threads | run 3/3 | current token time 19.94 ms - eval time 11340.81 ms - prompt eval time 159.49 ms
Running with 64 threads...
         64 threads | run 1/3 | current token time 20.72 ms - eval time 12184.85 ms - prompt eval time 165.77 ms
         64 threads | run 2/3 | current token time 20.45 ms - eval time 11545.2 ms - prompt eval time 163.62 ms
         64 threads | run 3/3 | current token time 20.14 ms - eval time 12126.9 ms - prompt eval time 161.15 ms
Running with 72 threads...
         72 threads | run 1/3 | current token time 30.12 ms - eval time 15985.67 ms - prompt eval time 240.92 ms
         72 threads | run 2/3 | current token time 29.78 ms - eval time 15781.94 ms - prompt eval time 238.2 ms
         72 threads | run 3/3 | current token time 30.32 ms - eval time 15877.56 ms - prompt eval time 242.59 ms

May 01 '23 17:05 kiratp

M1 Max, maxed GPU, 64 GB.

Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s

Benchmarks show that the M1 Max CPU can't use more than 243 GB/s of memory bandwidth.[^1] For an entirely CPU based algorithm like this one, there is little benefit over the Pro's 200 GB/s of memory bandwidth.

It would/will be a different story if/when the implementation is extended so that the GPU was involved.

^[1]:Huge memory bandwidth, but not for every block

May 05 '23 13:05 jackpal

The relevant bit

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

This seems to indicate that it is possible to exceed 200 GB/s (by 12% at 224GB/s, ignoring E cores since they slow things down overall) so an M1 Max should outperform an M1 Pro at 8 threads even if its not using all of the 400 GB/s the memory bus supports. However, I don't have an M1 Pro to test so YMMV.

May 05 '23 17:05 kiratp

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s.

Can anyone demonstrate a basic C program that achieves this speed? My M1 Pro goes only up to 40 GB/s on a single-thread

May 05 '23 18:05 ggerganov

Been a while since I've written any C, but here is a Rust program: https://github.com/kiratp/memory-bandwidth

Result with a bunch of stuff running on an M1 Max, 64 GB. I would assume the number would go higher on a clean restart with no other apps running

    Finished release [optimized] target(s) in 0.00s
     Running `target/release/memorybw`
Elapsed time: 8.57s, Bandwidth: 89018.66 MB/s

Idle Threadripper 3990x, 3200 MT/s 256 GB, with PBO level 3 enabled. 64 threads

   Compiling memorybw v0.1.0 (/home/kiratpandya/github/memory-bandwidth)
    Finished release [optimized] target(s) in 0.31s
     Running `target/release/memorybw`
Elapsed time: 87.78s, Bandwidth: 69534.59 MB/s

May 08 '23 17:05 kiratp

Alright so I got GPT4 to write me a C equivalent. I am not sure as to its quality but cursory analysis seems to indicate that it is correct but I think there is a bunch of overhead in the call to pthread_create. Same repo: https://github.com/kiratp/memory-bandwidth

Same M1:

Thread 6 completed. Accumulated value: 499971380.906237
Thread 0 completed. Accumulated value: 500034749.012463
Thread 1 completed. Accumulated value: 499991005.713216
Thread 3 completed. Accumulated value: 500045810.595650
Thread 5 completed. Accumulated value: 500009162.631447
Thread 2 completed. Accumulated value: 500017471.449399
Elapsed time: 1.029004 seconds
Memory bandwidth: 62.196065 GB/s

Same Threadripper:

<Snipping 64 thread readouts>
Thread 20 completed. Accumulated value: 499997710.298199
Elapsed time: 6.650833 seconds
Memory bandwidth: 76.982839 GB/s

May 08 '23 18:05 kiratp

7B-Q4 llama_print_timings: prompt eval time = 1784.07 ms / 9 tokens ( 198.23 ms per token) 13B-Q4 llama_print_timings: prompt eval time = 5200.94 ms / 13 tokens ( 400.07 ms per token)

I3-9100 DDR4 8G 2400 *2 38400 MB/sec Both are CPU 100% Only AVX2

May 11 '23 03:05 rankaiyx

llama.cpp llama.cpp copied to clipboard

benchmarks?

llama.cpp
llama.cpp copied to clipboard