llama.cpp
llama.cpp copied to clipboard
It appears context memory usage can be trivially halved by using fp16?
I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.
LLAMA 7B quantized to 4 bits reports ggml ctx size = 8113.34 MB
I went to the code and changed data type for memory_k
and memory_v
from GGML_TYPE_F32
to GGML_TYPE_F16
These are the changed lines:
ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v
And these:
model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
New memory usage is reportedly ggml ctx size = 6065.34 MB
and task manager agrees. That's 2GB down.
So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?