Results 102 comments of Stephan Walter

The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification. Edit: removed...

> Here is a AVX2 implementation of `ggml_vec_dot_q2_0_q8_0` that operates on two blocks at a time Thanks @slaren, I just added this. Apparently 2 bits are called a [crumb](https://mathworld.wolfram.com/Crumb.html), so...

> Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision That does help, but with a block size of 6 bytes,...

Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though.

At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large? Q2 probably...

Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has [published Alpaca/LoRA model files for Q2](https://huggingface.co/TheBloke/alpaca-lora-65B-GGML). These should still work now but I haven't tested that. On the...

Obsolete thanks to #1684

I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of `llama_apply_lora_from_file_internal`, right? Can you show exactly what command lines you use?

Thanks @slaren . I'm seeing 17s on master and 16s with your PR. Just because the SIMD optimizations were up for discussion: with `quantize_row_q_reference` in `ggml_compute_forward_dup_f16`, the difference is greater....

On the other hand, draft pull requests skip most of the CI checks, that's not so good.