llama.cpp ggml : alternative Q4_3 implementation using modified Q8

ggml : alternative Q4_3 implementation using modified Q8_0

Open ggerganov opened this issue 1 year ago • 0 comments

This one looks promising - it does not change the Q4_3 format from master and only modifies slightly Q8_0 by adding low and high sums. The results should be identical, but now the Q4_3 dot product evaluates much faster:

#define QK8_0 32
typedef struct {
    float   d;          // delta
    float   s0;         // d * sum(qs[i]) low
    float   s1;         // d * sum(qs[i]) high
    int8_t  qs[QK8_0];  // quants
} block_q8_0;

llama_print_timings:      sample time =    56.66 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   509.00 ms /     8 tokens (   63.63 ms per token)
llama_print_timings:        eval time =  3493.43 ms /    63 runs   (   55.45 ms per run)
llama_print_timings:       total time =  4069.64 ms

I think this is the way to go. But, let's see the ppl results from the Q4_3a #1108 approach first

Apr 21 '23 17:04 ggerganov

llama.cpp llama.cpp copied to clipboard

ggml : alternative Q4_3 implementation using modified Q8_0

llama.cpp
llama.cpp copied to clipboard