llama.cpp [Q] Memory Requirements for Different Model Sizes

Mar 11 '23 12:03 NightMachinery

7B (4-bit): 4.14 GB MEM
65B (4-bit): 38 GB MEM

Mar 11 '23 14:03 satyajitghana

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

7B => ~4 GB
13B => ~8 GB
30B => ~16 GB
65B => ~32 GB

Mar 11 '23 16:03 prusnak

With an M1 Max 64GB with 4-bit

65B: 38.5GB, 850 ms per token 30B: 19.5GB, 450 ms per token 13B: 7.8GB, 150 ms per token 7B: 4.0GB, 75 ms per token

Mar 12 '23 03:03 cannin

For the record, Intel® Core™ i5-7600K CPU @ 3.80GHz × 4, 16Gb ram, under Ubuntu, model 13B runs with acceptable response time. Note that as mentioned by previous comments, -t 4 parameter gives the best results. main: mem per token = 22357508 bytes main: load time = 83076.67 ms main: sample time = 267.12 ms main: predict time = 193441.61 ms / 367.76 ms per token main: total time = 277980.41 ms

Great work !

Mar 15 '23 17:03 dbddv01

Should add these to readme

Mar 15 '23 20:03 ggerganov

@prusnak is that pc ram or gpu vram ?

Mar 18 '23 08:03 sinanisler

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

Mar 18 '23 09:03 prusnak

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

Is it possible that at some point we will get a video card version?

Mar 18 '23 09:03 whitepapercg

Is it possible that at some point we will get a video card version?

I don' think so. You can use run the original Whisper model on a GPU: https://github.com/openai/whisper

Mar 18 '23 09:03 prusnak

Fwiw, running on my M2 Macbook Air w 8GB of ram comes to a grinding halt. At first run about 2-3 minutes of completely unresponsive machine (mouse and keyboard locked), then about 10-20 seconds per response word. Didn't expect great response times, but thats a bit slower than anticipated.

Edit: using 7B model

Mar 18 '23 10:03 mrpher

M2 Macbook Air w 8GB

Close every other app, ideally reboot to clean state. This should help. If you see unresponsive machine, then it is swapping memory to disk. 8GB is not that much, especially if you have Browsers, Slack, etc. running.

Mar 18 '23 10:03 prusnak

Also make sure you’re using 4 threads instead of 8 — you don’t want to be using any of the 4 efficiency cores.

Mar 18 '23 11:03 j-f1

Working well now, good recommendations @prusnak @j-f1 thank you

Mar 18 '23 17:03 mrpher

Requirements added in https://github.com/ggerganov/llama.cpp/pull/269

Mar 18 '23 21:03 prusnak

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

7B => ~4 GB

13B => ~8 GB

30B => ~16 GB

64 => ~32 GB

32gb is probably a little too optimistic, I have DDR4 32gb clocked at 3600mhz and it generates each token every 2 minutes.

Mar 21 '23 20:03 SpeedyCraftah

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

Mar 21 '23 21:03 prusnak

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

I see. That makes more sense since you mention the whole model is loaded into memory as of now. Linux would probably run better in this case from the better swap handling and lower memory usage.

Thanks!

Mar 21 '23 21:03 SpeedyCraftah

What languages does it work with? Does it work in the same output and input languages GPT?

Mar 31 '23 22:03 Yitzhokchaim

llama.cpp llama.cpp copied to clipboard

[Q] Memory Requirements for Different Model Sizes

llama.cpp
llama.cpp copied to clipboard