llama.cpp
llama.cpp copied to clipboard
[Q] Memory Requirements for Different Model Sizes
7B (4-bit): 4.14 GB MEM
65B (4-bit): 38 GB MEM
Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:
- 7B => ~4 GB
- 13B => ~8 GB
- 30B => ~16 GB
- 65B => ~32 GB
With an M1 Max 64GB with 4-bit
65B: 38.5GB, 850 ms per token 30B: 19.5GB, 450 ms per token 13B: 7.8GB, 150 ms per token 7B: 4.0GB, 75 ms per token
For the record, Intel® Core™ i5-7600K CPU @ 3.80GHz × 4, 16Gb ram, under Ubuntu, model 13B runs with acceptable response time. Note that as mentioned by previous comments, -t 4 parameter gives the best results. main: mem per token = 22357508 bytes main: load time = 83076.67 ms main: sample time = 267.12 ms main: predict time = 193441.61 ms / 367.76 ms per token main: total time = 277980.41 ms
Great work !
Should add these to readme
@prusnak is that pc ram or gpu vram ?
@prusnak is that pc ram or gpu vram ?
llama.cpp runs on cpu not gpu, so it's the pc ram
@prusnak is that pc ram or gpu vram ?
llama.cpp runs on cpu not gpu, so it's the pc ram
Is it possible that at some point we will get a video card version?
Is it possible that at some point we will get a video card version?
I don' think so. You can use run the original Whisper model on a GPU: https://github.com/openai/whisper
Fwiw, running on my M2 Macbook Air w 8GB of ram comes to a grinding halt. At first run about 2-3 minutes of completely unresponsive machine (mouse and keyboard locked), then about 10-20 seconds per response word. Didn't expect great response times, but thats a bit slower than anticipated.
- Edit: using 7B model
M2 Macbook Air w 8GB
Close every other app, ideally reboot to clean state. This should help. If you see unresponsive machine, then it is swapping memory to disk. 8GB is not that much, especially if you have Browsers, Slack, etc. running.
Also make sure you’re using 4 threads instead of 8 — you don’t want to be using any of the 4 efficiency cores.
Working well now, good recommendations @prusnak @j-f1 thank you
Requirements added in https://github.com/ggerganov/llama.cpp/pull/269
Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:
- 7B => ~4 GB
- 13B => ~8 GB
- 30B => ~16 GB
- 64 => ~32 GB
32gb is probably a little too optimistic, I have DDR4 32gb clocked at 3600mhz and it generates each token every 2 minutes.
32gb is probably a little too optimistic
Yeah, 38.5 GB is more realistic.
See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values
32gb is probably a little too optimistic
Yeah, 38.5 GB is more realistic.
See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values
I see. That makes more sense since you mention the whole model is loaded into memory as of now. Linux would probably run better in this case from the better swap handling and lower memory usage.
Thanks!
What languages does it work with? Does it work in the same output and input languages GPT?