metal : add Q2_K implementation

Open ikawrakow opened this issue 2 years ago • 1 comments

27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s.

The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token).

Jun 08 '23 15:06 ikawrakow

@ikawrakow Directly squash and merge after resolving the conflicts

Jun 08 '23 16:06 ggerganov

Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.

Jun 09 '23 14:06 EwoutH

Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.

I'm working on it. I have Q5_K done and working, but there is still a bug in Q3_K that I'm not able to see.

Jun 09 '23 17:06 ikawrakow