llama.cpp
llama.cpp copied to clipboard
Use F16 for memory_k and memory_v (as suggested in #146)
As suggested in #146 we are able to save lots of memory by using float16 instead of float32. I implemented the suggested changes, and tested with the 7B and 13B models, and there were no issues on my Intel-based MacBook Pro.
Merging these changes should allow more models to run more performantly on a wider range of hardware.
can confirm ggml ctx size
4529.34 MB -> 4273.34 MB
speed stayed the same.
it is hard to tell if the quality changes, but the prediction does (obviously).
I was worried that it might degrade quality, but I have no evals as you can guess. I think it is best to gate this through a command line argument. Have it F32 by default, and if requested by the user - set it to F16
I ran some more, non scientific tests:
7B:
30B:
both where ran with -t 4 -n 2048 --repeat_penalty 1.176 --repeat_last_n 256 --temp 0.8 --top_p 0.1 -c 2048 --color -i -r "User:" -f prompts/i_example1.txt
@ty-everett are you going to write the cli-param conditional version? if not, I will do it.