llama.cpp
llama.cpp copied to clipboard
ggml : alternative Q4_3 format + implementation
#define QK4_3 32
typedef struct {
ggml_fp16_t d0; // delta
ggml_fp16_t d1; // delta
ggml_fp16_t m; // min
uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;
Running a perplexity test to see how much we lost from having single min
factor in the structure instead of two
llama_print_timings: sample time = 56.68 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 448.06 ms / 8 tokens ( 56.01 ms per token)
llama_print_timings: eval time = 3177.30 ms / 63 runs ( 50.43 ms per run)
llama_print_timings: total time = 3691.84 ms