llama.cpp ggml : alternative Q4_3 format + implementation

ggml : alternative Q4_3 format + implementation

Open ggerganov opened this issue 1 year ago • 0 comments

#define QK4_3 32
typedef struct {
    ggml_fp16_t d0;        // delta
    ggml_fp16_t d1;        // delta
    ggml_fp16_t m;         // min
    uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;

Running a perplexity test to see how much we lost from having single min factor in the structure instead of two

llama_print_timings:      sample time =    56.68 ms /    64 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time =   448.06 ms /     8 tokens (   56.01 ms per token)
llama_print_timings:        eval time =  3177.30 ms /    63 runs   (   50.43 ms per run)
llama_print_timings:       total time =  3691.84 ms

Apr 21 '23 17:04 ggerganov

llama.cpp llama.cpp copied to clipboard

ggml : alternative Q4_3 format + implementation

llama.cpp
llama.cpp copied to clipboard