llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

faster performance on older machines

Open sibeliu opened this issue 1 year ago • 15 comments

On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512

sibeliu avatar Mar 11 '23 17:03 sibeliu

@sibeliu what does getconf _NPROCESSORS_ONLN say on your machine?

prusnak avatar Mar 11 '23 18:03 prusnak

Not sure why, but on my Mac M1 Pro / 16GB using 4 threads works far better than 8 threads:

(base) musixmatch@Loretos-MBP llama.cpp % ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 4 -n 512
main: seed = 1678565187
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:
1. Get a server
2. Set up a database
3. Get a name server
4. Get a web server
5. Get a control panel
6. Set up a phpMyAdmin interface for MySQL
7. Set up a database user
8. Get a root password for your server
9. Upload your files
10. Test your files
After that you need to make sure that your website is indexed in Google. You can do that with Google Webmaster Tools.
I have only listed the most important and core steps to create a website. The rest of the process depends on your level of experience and on what you want to accomplish with your website.
In this article I will focus on the first 10 steps. I hope that this article will give you a good understanding of how to create a website from scratch.
My goal is to create a website that I can use to make a living online. My vision is to make a website that can generate more than $100 per day, that’s $3,000 per month.
I don’t think that it’s possible to have a website that will generate money while you sleep. There will always be work involved, so I don’t want to start off with a website that I need to work on all day long.
I want to have a website that can generate at least $100 per day, that’s $3,000 per month. It’s not going to be easy, but I know that it’s possible.
Some people create websites to make money, others do it to learn new skills. Creating a website is a good exercise in itself, but the real benefit is to get more traffic to your website.
The more traffic you get, the bigger your website will be. The bigger your website is, the more income it will generate.
In this article I will guide you through the process of creating a website step by step.
A website doesn’t have to be something fancy, but it needs to be professional looking. You don’t need to go for the best looking website out there, but make sure that your website has a good overall impression.
If you want to create a website to make money online, then make sure that your website is easy to navigate. A website that is easy to navigate will be a lot easier to

main: mem per token = 14368644 bytes
main:     load time =  1301.80 ms
main:   sample time =  1098.24 ms
main:  predict time = 59317.72 ms / 116.08 ms per token
main:    total time = 62136.24 ms

loretoparisi avatar Mar 11 '23 20:03 loretoparisi

Here's my benchmark on Apple M1 16 GB:

threads 1 2 4 8
7B 4-bit 316.52 164.76 98.56 277.92
13B 4-bit 628.48 376.14 224.21 538.87

(ms per token)

I think it's good that the default value for -t is 4.

prusnak avatar Mar 11 '23 20:03 prusnak

@prusnak that's because there's 4 performance cores on M1

wizzard0 avatar Mar 11 '23 21:03 wizzard0

@prusnak that's because there's 4 performance cores on M1

but on my M1 Pro I have 8 cores...

loretoparisi avatar Mar 11 '23 21:03 loretoparisi

8, which would be nice to use. With the current setup I'm only using 4

sibeliu avatar Mar 11 '23 23:03 sibeliu

8, which would be nice to use. With the current setup I'm only using 4

It should, but it seems that there is some bottleneck on the M1 Pro that prevents to have better perfs with the 8 threads, resulting in a slower inference when specifyng n=4, not sure why. I will do a better test.

loretoparisi avatar Mar 12 '23 14:03 loretoparisi

I'm getting the same reuslts on a 4c/8t i7 skylake on linux (7B model, 4-bit). -t 4 is several times faster than -t 8

plhosk avatar Mar 12 '23 17:03 plhosk

I guess this is because hyperthreading does not help with running the model? So the number of virtual cores is not important only number of physical cores?

prusnak avatar Mar 12 '23 17:03 prusnak

Upon further testing it seems like if I have anything else using the CPU e.g. having Firefox open and watching a video, -t 8 slows to a crawl while -t 4 is relatively unaffected, but after closing all CPU-consuming programs -t 8 becomes faster than -t 4.

plhosk avatar Mar 12 '23 17:03 plhosk

I guess this is because hyperthreading does not help with running the model? So the number of virtual cores is not important only number of physical cores?

That looks like the cause. Even though getconf _NPROCESSORS_ONLN says 8, there are only 4 physical cores on my processor. But it is still odd that both -t 4 and 8 utilize only 50% of my available processor. If I launch other apps it goes over 50%.

BTW can you think of any way to make the GPU help out? It isn't doing anything at the moment

sibeliu avatar Mar 12 '23 17:03 sibeliu

BTW can you think of any way to make the GPU help out? It isn't doing anything at the moment

This project is CPU only however there's a different one that runs on the GPU. Keep in mind the weights are not compatible between the two projects.

https://github.com/oobabooga/text-generation-webui

plhosk avatar Mar 12 '23 17:03 plhosk

This project is CPU only however there's a different one that runs on the GPU. Keep in mind the weights are not compatible between the two projects.

https://github.com/oobabooga/text-generation-webui

Thank you! I'll take a look. I just have the GPU in my macbook, wish I had an A100 or something...

sibeliu avatar Mar 12 '23 18:03 sibeliu

Interestingly, -t4 works much faster than -t8 on my 4-cores 8-threads i7-8550U too.

l29ah avatar Mar 13 '23 11:03 l29ah

I suspect this is because the inference is memory I/O bottlenecked and not CPU bottlenecked. On my 16 core (32 hyperthread) system -t 16:

main: mem per token = 43600900 bytes
main:     load time = 36559.40 ms
main:   sample time =   196.16 ms
main:  predict time = 80744.02 ms / 684.27 ms per token
main:    total time = 125217.42 ms

and -t 8:

main: mem per token = 43387780 bytes
main:     load time = 30327.24 ms
main:   sample time =   185.50 ms
main:  predict time = 116525.71 ms / 987.51 ms per token
main:    total time = 150837.81 ms

So I'm not getting linear scaling by doubling the number of cores. Instructions per second look to be around 0.6 which also confirms this.

gjmulder avatar Mar 13 '23 12:03 gjmulder

On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512

Hi @sibeliu, I cannot load the model in my Intel i7 machine. I get:

main: build = 635 (5c64a09)
main: seed  = 1686146865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
error loading model: unexpectedly reached end of file
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

Any ideas?

jyviko avatar Jun 07 '23 14:06 jyviko

Hi Iraklis, How much memory do you have? Have you tried one of the smaller quantized versions that has been recently released? Also, what exact shell command are you using to run it?

On Jun 7, 2023, at 4:08 PM, Iraklis Kourtis @.***> wrote:

On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512

Hi @sibeliu https://github.com/sibeliu, I cannot load the model in my Intel i7 machine. I get:

main: build = 635 (5c64a09) main: seed = 1686146865 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin error loading model: unexpectedly reached end of file llama_init_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin' main: error: unable to load model Any ideas?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/18#issuecomment-1580899485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVBGEXLU2HRC4YBOS7Y52LXKCDOFANCNFSM6AAAAAAVXTK6WQ. You are receiving this because you were mentioned.

sibeliu avatar Jun 07 '23 15:06 sibeliu

I have 16Gb of memory. Here is my command:

./main -m ./models/7B/ggml-model-q4_0.bin -p "[PROMPT]" -t 4 -n 512

jyviko avatar Jun 07 '23 20:06 jyviko

It looks to me like a corrupt binary. Maybe try re-downloading the model and start from scratch? Sorry I can’t be more helpful

On Jun 7, 2023, at 10:20 PM, Iraklis Kourtis @.***> wrote:

I have 16Gb of memory. Here is my command:

./main -m ./models/7B/ggml-model-q4_0.bin -p "[PROMPT]" -t 4 -n 512 — Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/18#issuecomment-1581453640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVVBGEUT7RYVUCB7TEWF4C3XKDO7RANCNFSM6AAAAAAVXTK6WQ. You are receiving this because you were mentioned.

sibeliu avatar Jun 08 '23 05:06 sibeliu

Hi @sibeliu, I cannot load the model in my Intel i7 machine. I get:

main: build = 635 (5c64a09)
main: seed  = 1686146865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
error loading model: unexpectedly reached end of file
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

Any ideas?

@jyviko: Where did you download ggml-model-q4_0.bin?

If you downloaded it from somewhere, try converting it from consolidated.00.pth following this guide:

  • Download it from https://github.com/facebookresearch/llama/pull/73/files
  • Convert:
python convert-pth-ggml.py models/7B/ 1
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

to see if it works:

$ ./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128 -p 'The meaning of life is'
main: build = 917 (1a94186)
main: seed  = 1690512360
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 3949.96 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 The meaning of life is to know the value and purpose of everything.
This entry was posted in Uncategorized on March 14, 2017 by rele3915. [end of text]

llama_print_timings:        load time =  2429.46 ms
llama_print_timings:      sample time =    27.65 ms /    39 runs   (    0.71 ms per token,  1410.34 tokens per second)
llama_print_timings: prompt eval time =   199.72 ms /     6 tokens (   33.29 ms per token,    30.04 tokens per second)
llama_print_timings:        eval time =  1630.04 ms /    38 runs   (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:       total time =  1860.81 ms

quantonganh avatar Jul 28 '23 02:07 quantonganh