llama.cpp
llama.cpp copied to clipboard
llama : fix K-shift with quantized K (wip)
Opening this as a proof of concept of a possible solution. It should work, but it requires implementing a quant -> F32 ggml_cpy
op in the backends.
Yup, we should do that. Having Q -> F32 will be useful anyway. Though it's not very high-prio IMO
Thanks for having looked into this. I understand that it's not our priority for the moment, so no problem.
I can confirm that this PR resolve the problem in mentioned in my issue, but throw another error on ggml_compute_forward_dup
(which is expected for now, since we still need some changes in ggml backend)
Add cpy fp16 to q8_0 and q8_0 to fp16: https://github.com/ggerganov/llama.cpp/commit/3d92acfb8d41ca4d924743ffa6f7cfba105c23f5
Test on M2 pro (metal backend).
I'm not familiar with CUDA, so pls check.
There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.