ggml [WIP]: Fix #286

Based on the fix for #292 here a try to fix #286 (with the caveat: tested only on CPU). The patch adds the llama.cpp tests for quantization.

One question I'm pondering is if we should get rid of the *_reference functions or add some more?

Jun 28 '23 12:06 goerch

Thanks! Let's rebase on latest master so that the diff takes into account the new macros from #292

Btw, there has been recently an update to https://github.com/ggerganov/llama.cpp/pull/1237 Should synchronize with @sw

One question I'm pondering is if we should get rid of the *_reference functions or add some more?

We need them to have a deterministic way of generating models. Add more in what sense?

Jul 02 '23 16:07 ggerganov

Thanks! Let's rebase on latest master so that the diff takes into account the new macros from #292

Should be done.

Btw, there has been recently an update to ggerganov/llama.cpp#1237 Should synchronize with @sw

Will look into this.

One question I'm pondering is if we should get rid of the *_reference functions or add some more?

We need them to have a deterministic way of generating models. Add more in what sense?

We have something like

        .quantize_row_q           = quantize_row_q4_0,
        .quantize_row_q_reference = quantize_row_q4_0_reference,

but the original patch included

        .quantize_row_q           = ggml_fp32_to_fp16_row,
        .quantize_row_q_reference = ggml_fp32_to_fp16_row,

We could define ggml_fp32_to_fp16_row_reference, too. But I also imagine that dropping '.quantize_row_q_reference' member from the struct could work?

Jul 02 '23 17:07 goerch