Georgi Gerganov issues

Results 113 issues of


Georgi Gerganov

New Q4_0 implementation using 2x F16 instead of 1x F32

ref #959 ### ARM NEON only implementation #### Timing Time per token `~55 ms` Up from `~50 ms` on `Q4_0` `master` #### Perplexity ##### Without `BLAS` 25 iters: 6.5251 ```...

generation quality

ggml : test dot product q4_0 x f32

Plugged @ikawrakow's idea from #1041 On `master`, I get ~51 ms / token: ```java $ make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c...

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32()

This part takes about 10% of the total inference time for 7B and it is currently single-threaded: https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884 Try to multi-thread this by splitting the work across rows. Since the...

enhancement

good first issue

performance

ggml : alternative Q4_3 implementation using modified Q8_0

This one looks promising - it does not change the `Q4_3` format from `master` and only modifies slightly `Q8_0` by adding low and high sums. The results should be identical,...

ggml : alternative Q4_3 format + implementation

```c #define QK4_3 32 typedef struct { ggml_fp16_t d0; // delta ggml_fp16_t d1; // delta ggml_fp16_t m; // min uint8_t qs[QK4_3 / 2]; // nibbles / quants } block_q4_3; ```...

llama : quantize attention results

ref #1098 Here we re-quantize the F16 intermediate results in the attention layer. This way, *all* matrix multiplication in the transformer become quantized. Putting this here just for reference. Haven't...

demo

Try to use quantized `ggml_mul_mat` in attention layer

The following 2 matrix multiplication calls sill remain in FP16 precission: - https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1135-L1137 - https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1158-L1160 Was wondering, if we quantize those on-the-fly would there be any benefit. The quantization can...

good first issue

performance

Combine large LLM with small LLM for faster inference

So I was thinking about the following idea. It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone...

question

research 🔬

llama : refactor get / set state + remove redundant kv cache API

- Normalize the code style - Move the definitions at the correct place in `llama.cpp` - Retire `llama_get_kv_cache()`, `llama_get_kv_cache_size()` and `llama_set_kv_cache()` Not sure how to test this - maybe we...

refactoring

ggml : export symbols

This allows to build `ggml` as a shared library Running CI to make sure this does not break something