llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

It appears context memory usage can be trivially halved by using fp16?

Open jarcen opened this issue 1 year ago • 0 comments

I'm not fully familiar with this codebase, so pardon if I'm wrong. My first attempt to modify the code was to expand hardcoded context window of 512 to 4096 but additional memory usage was not pleasant.

LLAMA 7B quantized to 4 bits reports ggml ctx size = 8113.34 MB

I went to the code and changed data type for memory_k and memory_v from GGML_TYPE_F32 to GGML_TYPE_F16

These are the changed lines:

        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_k
        ctx_size += n_ctx*n_layer*n_embd*ggml_type_sizef(GGML_TYPE_F16); // memory_v

And these:

        model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);
        model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n_elements);

New memory usage is reportedly ggml ctx size = 6065.34 MB and task manager agrees. That's 2GB down. So far everything is working, no crashes and no degradation in quality. Is there any reason to not do that?

jarcen avatar Mar 14 '23 23:03 jarcen