Stephan Walter
Stephan Walter
The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification. Edit: removed...
> Here is a AVX2 implementation of `ggml_vec_dot_q2_0_q8_0` that operates on two blocks at a time Thanks @slaren, I just added this. Apparently 2 bits are called a [crumb](https://mathworld.wolfram.com/Crumb.html), so...
> Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision That does help, but with a block size of 6 bytes,...
Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though.
At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large? Q2 probably...
Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has [published Alpaca/LoRA model files for Q2](https://huggingface.co/TheBloke/alpaca-lora-65B-GGML). These should still work now but I haven't tested that. On the...
Obsolete thanks to #1684
I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of `llama_apply_lora_from_file_internal`, right? Can you show exactly what command lines you use?
Thanks @slaren . I'm seeing 17s on master and 16s with your PR. Just because the SIMD optimizations were up for discussion: with `quantize_row_q_reference` in `ggml_compute_forward_dup_f16`, the difference is greater....
On the other hand, draft pull requests skip most of the CI checks, that's not so good.