llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

metal : add Q2_K implementation

Open ikawrakow opened this issue 2 years ago • 1 comments

27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s.

The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token).

ikawrakow avatar Jun 08 '23 15:06 ikawrakow

@ikawrakow Directly squash and merge after resolving the conflicts

ggerganov avatar Jun 08 '23 16:06 ggerganov

Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.

EwoutH avatar Jun 09 '23 14:06 EwoutH

Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.

I'm working on it. I have Q5_K done and working, but there is still a bug in Q3_K that I'm not able to see.

ikawrakow avatar Jun 09 '23 17:06 ikawrakow