llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[Q] Memory Requirements for Different Model Sizes

Open NightMachinery opened this issue 1 year ago • 5 comments

NightMachinery avatar Mar 11 '23 12:03 NightMachinery

7B (4-bit): 4.14 GB MEM
65B (4-bit): 38 GB MEM

satyajitghana avatar Mar 11 '23 14:03 satyajitghana

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

  • 7B => ~4 GB
  • 13B => ~8 GB
  • 30B => ~16 GB
  • 65B => ~32 GB

prusnak avatar Mar 11 '23 16:03 prusnak

With an M1 Max 64GB with 4-bit

65B: 38.5GB, 850 ms per token 30B: 19.5GB, 450 ms per token 13B: 7.8GB, 150 ms per token 7B: 4.0GB, 75 ms per token

cannin avatar Mar 12 '23 03:03 cannin

For the record, Intel® Core™ i5-7600K CPU @ 3.80GHz × 4, 16Gb ram, under Ubuntu, model 13B runs with acceptable response time. Note that as mentioned by previous comments, -t 4 parameter gives the best results. main: mem per token = 22357508 bytes main: load time = 83076.67 ms main: sample time = 267.12 ms main: predict time = 193441.61 ms / 367.76 ms per token main: total time = 277980.41 ms

Great work !

dbddv01 avatar Mar 15 '23 17:03 dbddv01

Should add these to readme

ggerganov avatar Mar 15 '23 20:03 ggerganov

@prusnak is that pc ram or gpu vram ?

sinanisler avatar Mar 18 '23 08:03 sinanisler

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

prusnak avatar Mar 18 '23 09:03 prusnak

@prusnak is that pc ram or gpu vram ?

llama.cpp runs on cpu not gpu, so it's the pc ram

Is it possible that at some point we will get a video card version?

whitepapercg avatar Mar 18 '23 09:03 whitepapercg

Is it possible that at some point we will get a video card version?

I don' think so. You can use run the original Whisper model on a GPU: https://github.com/openai/whisper

prusnak avatar Mar 18 '23 09:03 prusnak

Fwiw, running on my M2 Macbook Air w 8GB of ram comes to a grinding halt. At first run about 2-3 minutes of completely unresponsive machine (mouse and keyboard locked), then about 10-20 seconds per response word. Didn't expect great response times, but thats a bit slower than anticipated.

  • Edit: using 7B model

mrpher avatar Mar 18 '23 10:03 mrpher

M2 Macbook Air w 8GB

Close every other app, ideally reboot to clean state. This should help. If you see unresponsive machine, then it is swapping memory to disk. 8GB is not that much, especially if you have Browsers, Slack, etc. running.

prusnak avatar Mar 18 '23 10:03 prusnak

Also make sure you’re using 4 threads instead of 8 — you don’t want to be using any of the 4 efficiency cores.

j-f1 avatar Mar 18 '23 11:03 j-f1

Working well now, good recommendations @prusnak @j-f1 thank you

mrpher avatar Mar 18 '23 17:03 mrpher

Requirements added in https://github.com/ggerganov/llama.cpp/pull/269

prusnak avatar Mar 18 '23 21:03 prusnak

Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:

  • 7B => ~4 GB
  • 13B => ~8 GB
  • 30B => ~16 GB
  • 64 => ~32 GB

32gb is probably a little too optimistic, I have DDR4 32gb clocked at 3600mhz and it generates each token every 2 minutes.

SpeedyCraftah avatar Mar 21 '23 20:03 SpeedyCraftah

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

prusnak avatar Mar 21 '23 21:03 prusnak

32gb is probably a little too optimistic

Yeah, 38.5 GB is more realistic.

See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values

I see. That makes more sense since you mention the whole model is loaded into memory as of now. Linux would probably run better in this case from the better swap handling and lower memory usage.

Thanks!

SpeedyCraftah avatar Mar 21 '23 21:03 SpeedyCraftah

What languages ​​does it work with? Does it work in the same output and input languages ​​GPT?

Yitzhokchaim avatar Mar 31 '23 22:03 Yitzhokchaim