llama.cpp llama : fix K-shift with quantized K (wip)

llama : fix K-shift with quantized K (wip)

Open slaren opened this issue 1 year ago • 2 comments

Opening this as a proof of concept of a possible solution. It should work, but it requires implementing a quant -> F32 ggml_cpy op in the backends.

Feb 21 '24 23:02 slaren

Yup, we should do that. Having Q -> F32 will be useful anyway. Though it's not very high-prio IMO

Feb 22 '24 08:02 ggerganov

Thanks for having looked into this. I understand that it's not our priority for the moment, so no problem.

I can confirm that this PR resolve the problem in mentioned in my issue, but throw another error on ggml_compute_forward_dup (which is expected for now, since we still need some changes in ggml backend)

Feb 22 '24 13:02 ngxson

Add cpy fp16 to q8_0 and q8_0 to fp16: https://github.com/ggerganov/llama.cpp/commit/3d92acfb8d41ca4d924743ffa6f7cfba105c23f5

Test on M2 pro (metal backend).

I'm not familiar with CUDA, so pls check.

Mar 18 '24 06:03 vonjackustc

There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.

Mar 18 '24 10:03 slaren

llama.cpp llama.cpp copied to clipboard

llama : fix K-shift with quantized K (wip)

llama.cpp
llama.cpp copied to clipboard