llama.cpp Use F16 for memory_k and memory

Use F16 for memory_k and memory_v (as suggested in #146)

Open ty-everett opened this issue 1 year ago • 4 comments

As suggested in #146 we are able to save lots of memory by using float16 instead of float32. I implemented the suggested changes, and tested with the 7B and 13B models, and there were no issues on my Intel-based MacBook Pro.

Merging these changes should allow more models to run more performantly on a wider range of hardware.

Mar 15 '23 06:03 ty-everett

can confirm ggml ctx size 4529.34 MB -> 4273.34 MB speed stayed the same.

it is hard to tell if the quality changes, but the prediction does (obviously).

Mar 15 '23 14:03 Green-Sky

I was worried that it might degrade quality, but I have no evals as you can guess. I think it is best to gate this through a command line argument. Have it F32 by default, and if requested by the user - set it to F16

Mar 15 '23 19:03 ggerganov

I ran some more, non scientific tests:

7B:

30B:

both where ran with -t 4 -n 2048 --repeat_penalty 1.176 --repeat_last_n 256 --temp 0.8 --top_p 0.1 -c 2048 --color -i -r "User:" -f prompts/i_example1.txt

Mar 16 '23 14:03 Green-Sky

@ty-everett are you going to write the cli-param conditional version? if not, I will do it.

Mar 18 '23 00:03 Green-Sky

llama.cpp llama.cpp copied to clipboard

Use F16 for memory_k and memory_v (as suggested in #146)

llama.cpp
llama.cpp copied to clipboard