Stephan Walter
Stephan Walter
I consider it ready now but I'm open to revising it or waiting for other changes. I agree that it should be tested on ARM and AVX512.
[Low-bit Quantization of Neural Networks for Efficient Inference](https://arxiv.org/abs/1902.06822) deals with 4-bit quantization specifically. As a smaller step, I can think of these optimizations: * use F16 for the scaling factor....
The readme still says: > The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image....
Probably an old hard-coded value: #142?
As the Makefile no longer sets specific instruction set options, but uses `-march=native -mtune=native`, this should no longer occur. Please reopen if you still have the problem with the latest...
Go home Q2, you're drunk ;-) ``` $ ./main -m ./models/7B/ggml-model-q2_0.bin -p "The efforts needed to add this support are so small that there is no reason not to do...
Updated my branch with AVX optimizations, probably far from perfect. Still quite slow... Q2: ``` 98.37 seconds per pass - ETA 17.90 hours [1]147.6625,[2]136.8862,[3]132.6015,[4]127.8629,[5]120.4091,[6]111.7640,[7]114.2548,[8]112.8951, ``` Q3: ``` 203.61 seconds per...
Agree; I was recently confused by the various type ids (`ggml_type` vs `model.hparams.f16` which doesn't have an enum) Though I think for performance reasons you can't really put to much...
clang on macOS is apparently stricter, I'll clean this up using the warnings from the CI run. I'm not sure if the `double` precision is needed in `ggml_compute_forward_rope_f32`/`_f16`.
> i assume inference speed changes will be minimal, and only really a thing with simd disabled? I believe `master` got a bit slower recently, but I can't detect a...