llama.cpp
llama.cpp copied to clipboard
Q4_K implementation for Metal
Implemented mostly following the Q4_0 Metal implementation.
Slightly slower than Q4_0: on my 30-core M2 Max GPU and 256 tokens it takes 28.1 ms/token compared to 27.0 ms/token for Q4_0.