llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Hows the inference speed and mem usage?

Open lucasjinreal opened this issue 1 year ago • 13 comments

Hows the inference speed and mem usage?

lucasjinreal avatar Mar 12 '23 06:03 lucasjinreal

Did some testing on my machine (AMD 5700G with 32GB RAM on Arch Linux) and was able to run most of the models. With the 65B model, I would need 40+ GB of ram and using swap to compensate was just too slow. (Prompt was "They" on seed 1678609319)

Quantized Model Threads Memory use Time per token
llama-7b 4 4.2GB 137.10 ms
llama-7b 6 4.2GB 100.43 ms
llama-7b 8 4.2GB 112.44 ms
llama-7b 10 4.2GB 131.63 ms
llama-7b 12 4.2GB 132.73 ms
llama-13b 4 7.9GB 261.88 ms
llama-13b 6 7.9GB 190.74 ms
llama-13b 8 7.9GB 209.15 ms
llama-13b 10 7.9GB 244.64 ms
llama-13b 12 7.9GB 257.72 ms
llama-30b 4 19GB 645.15 ms
llama-30b 6 19GB 463.04 ms
llama-30b 8 19GB 476.64 ms
llama-30b 10 19GB 583.19 ms
llama-30b 12 19GB 593.75 ms
My PC has 8 cores, so it seems like with whisper.cpp keeping threads at 6/7 gives the best results.

factfictional avatar Mar 12 '23 09:03 factfictional

Will run the same tests on an EPYC 7443P to compare, should be able to run 65B - copying to my SSD now.

ElRoberto538 avatar Mar 12 '23 09:03 ElRoberto538

@ggerganov It looks like very nice, on CPU but gives a reasonable speed. I can run 13b even on my PC, how do u think the inference speed on a more long prompt tokens?

lucasjinreal avatar Mar 12 '23 09:03 lucasjinreal

Ryzen 9 5900X on llama-65b eats 40GB ram.

mem per token = 70897348 bytes
load time = 68146.04 ms
sample time =  1002.82 ms
predict time = 478729.38 ms / 936.85 ms per token
total time = 550394.94 ms

breakpointninja avatar Mar 12 '23 10:03 breakpointninja

AMD EPYC 7443P 24 core (in VM). Prompt was "They" on seed 1678609319, as above.

Quantized Model Threads Memory use Time per token
llama-7b 4 4.2 156.95 ms
llama-7b 6 4.2 113.06 ms
llama-7b 8 4.2 93.00 ms
llama-7b 10 4.2 85.18 ms
llama-7b 12 4.2 77.18 ms
llama-7b 21 4.2 76.53 ms
llama-7b 24 4.2 85.37 ms
llama-65b 4 41GB 1408.27 ms
llama-65b 6 41GB 978.18 ms
llama-65b 8 41GB 772.21 ms
llama-65b 10 41GB 654.20 ms
llama-65b 12 41GB 592.60 ms
llama-65b 21 41GB 577.96 ms
llama-65b 24 41GB 596.61 ms
llama-65b 48 41GB 1431.73 ms

Interestingly it doesn't seem to scale well with cores, I guess it likes a few fast cores and high memory bandwidth?

ElRoberto538 avatar Mar 12 '23 12:03 ElRoberto538

It scales with real cores. Once you get to virtual cores (threads) it starts going badly. If you have a 8 core 16 thread use 8 cores... or a 24 core 48 thread use 24 cores etc

G2G2G2G avatar Mar 12 '23 12:03 G2G2G2G

I have 24 real cores, but if you look at the numbers above, it seems to hit a wall at around 12 threads, and barely improves when doubling the threads to 21/24

ElRoberto538 avatar Mar 12 '23 12:03 ElRoberto538

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those? Not an issue, just mildly curious.

7b (6 threads): main: predict time = 35909.73 ms / 120.91 ms per token

13b (6 threads): main: predict time = 67519.31 ms / 227.34 ms per token

30b (6 threads): main: predict time = 165125.56 ms / 555.98 ms per token

pugzly avatar Mar 12 '23 22:03 pugzly

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those? Not an issue, just mildly curious.

7b (6 threads): main: predict time = 35909.73 ms / 120.91 ms per token

13b (6 threads): main: predict time = 67519.31 ms / 227.34 ms per token

30b (6 threads): main: predict time = 165125.56 ms / 555.98 ms per token

My assumption is memory bandwidth, my per core speed should be slower than yours according to benchmarks, but when I run with 6 threads I get faster performance. My RAM is slow, but 8 memory channels vs 2 makes up for that I guess.

ElRoberto538 avatar Mar 12 '23 22:03 ElRoberto538

Speeds on an old 4c/8t intel i7 with above prompt/seed:

7B, n=128 t=4 165 ms/token t=5 220 ms/token t=6 188 ms/token t=7 168 ms/token t=8 154 ms/token

13B t=4 314 ms/token t=5 420 ms/token t=6 360 ms/token t=7 314 ms/token t=8 293 ms/token

Interesting how the fastest runs are t=4 and t=8 with the ones between being slower.

In comparison in I'm getting around 20-25 tokens/s (40-50 ms/token) on a 3060ti with the 7B model in text-generation-webui with the same prompt (although it gets much slower with higher amounts of context). If only GPUs had cheap, expandable VRAM.

plhosk avatar Mar 13 '23 00:03 plhosk

Interesting, with almost the same setup as the top comment (AMD 5700G with 32GB RAM but Linux Mint) I get about 20% slower speed per token. Maybe prompt length had something to do with it, or my memory dimms are slower, or arch faster, combinations of all those? Not an issue, just mildly curious. 7b (6 threads): main: predict time = 35909.73 ms / 120.91 ms per token 13b (6 threads): main: predict time = 67519.31 ms / 227.34 ms per token 30b (6 threads): main: predict time = 165125.56 ms / 555.98 ms per token

My assumption is memory bandwidth, my per core speed should be slower than yours according to benchmarks, but when I run with 6 threads I get faster performance. My RAM is slow, but 8 memory channels vs 2 makes up for that I guess.

Ah, yes. I have on my system Crucial 3200 MHz DDR4 (16GB x 2) kit, but all this time I had it running at 2666MHz, for whatever reason. I actually didn't expect memory to be such bottleneck on this workload, I would have blamed CPU exclusively for every millisecond.

Now after changing settings in my BIOS, at 3200 MHz, numbers still not exactly on par, but close enough:

7B: main: predict time = 31586.56 ms / 106.35 ms per token 13B: main: predict time = 59035.98 ms / 198.77 ms per token 30B: main: predict time = 139936.17 ms / 484.21 ms per token

pugzly avatar Mar 13 '23 08:03 pugzly

This might be a dumb question but is there any way to reduce the memory requirements even if it increases inference time? Or is this a constant based on the model architecture and weights?

dennislysenko avatar Mar 13 '23 16:03 dennislysenko

This might be a dumb question but is there any way to reduce the memory requirements even if it increases inference time?

Currently no, other than adding a lot of swap space, but even with a fast NVMe drive it will be orders of magnitude slower than running fully in memory.

plhosk avatar Mar 14 '23 02:03 plhosk

Memory/disk requirements are being added in https://github.com/ggerganov/llama.cpp/pull/269

As for the inference speed, feel free to discuss here, but I am closing this issue.

prusnak avatar Mar 18 '23 21:03 prusnak