llama.cpp
llama.cpp copied to clipboard
ggml : alternative Q4_3 implementation using modified Q8_0
This one looks promising - it does not change the Q4_3
format from master
and only modifies slightly Q8_0
by adding low and high sums. The results should be identical, but now the Q4_3
dot product evaluates much faster:
#define QK8_0 32
typedef struct {
float d; // delta
float s0; // d * sum(qs[i]) low
float s1; // d * sum(qs[i]) high
int8_t qs[QK8_0]; // quants
} block_q8_0;
llama_print_timings: sample time = 56.66 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 509.00 ms / 8 tokens ( 63.63 ms per token)
llama_print_timings: eval time = 3493.43 ms / 63 runs ( 55.45 ms per run)
llama_print_timings: total time = 4069.64 ms
I think this is the way to go. But, let's see the ppl results from the Q4_3a
#1108 approach first