metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token).
@ikawrakow Directly squash and merge after resolving the conflicts
Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.
Are you planning on adding Q3_K? As I understood, 2-bit quantization is quite a bad tradeoff compared to 3-bit.
I'm working on it. I have Q5_K done and working, but there is still a bug in Q3_K that I'm not able to see.