Stephan Walter comments

Results 102 comments of


                                            Stephan Walter

Q2 and Q3 quantization

The build failures on macOS show that I've messed up with using AVX2 in the AVX part, so this probably won't work on an AVX-only machine without modification. Edit: removed...

> Here is a AVX2 implementation of `ggml_vec_dot_q2_0_q8_0` that operates on two blocks at a time Thanks @slaren, I just added this. Apparently 2 bits are called a [crumb](https://mathworld.wolfram.com/Crumb.html), so...

Q2 and Q3 quantization

> Maybe it could further improve your Q2 and Q3 if you keep the last tensor in high precision That does help, but with a block size of 6 bytes,...

Q2 and Q3 quantization

Thanks to @pubby, the Q3 code is now faster on AVX2 and should be more amenable to other SIMD optimizations. You'll have to re-quantize the model, though.

Q2 and Q3 quantization

At this point I'm wondering if we should target a specific model size. Is there any environment (wasm for example) where the 4Gbyte 7B Q4_0 is too large? Q2 probably...

Q2 and Q3 quantization

Rebased onto master, but I kept the tensor/ftype numbering, because @TheBloke has [published Alpaca/LoRA model files for Q2](https://huggingface.co/TheBloke/alpaca-lora-65B-GGML). These should still work now but I haven't tested that. On the...

Q2 and Q3 quantization

Obsolete thanks to #1684

Multi-threaded ggml_cpy

I'm barely seeing an improvement (AVX2, 4 cores). This is about the run time of `llama_apply_lora_from_file_internal`, right? Can you show exactly what command lines you use?

Multi-threaded ggml_cpy

Thanks @slaren . I'm seeing 17s on master and 16s with your PR. Just because the SIMD optimizations were up for discussion: with `quantize_row_q_reference` in `ggml_compute_forward_dup_f16`, the difference is greater....

Editing a PR description shouldn't cause a CI run

On the other hand, draft pull requests skip most of the CI checks, that's not so good.