Georgi Gerganov
Georgi Gerganov
Add `Q2_0` and `Q2_1` quantization support to `ggml`: - Follow the existing `Q4_0` and `Q4_1` implementations - Implement [reference scalar quantization and dequantization routines](https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L407-L449) - I suspect we might have...
See explanation here: https://github.com/ggerganov/llama.cpp/pull/439
Currently, the `main` example has a `instruct` parameter which enables something similar to instruction-based mode. I haven't understood it completely, but this seems to be what the Alpaca models are...
For now I just added empty README.md files: - https://github.com/ggerganov/llama.cpp/tree/master/examples/main - https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize - https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity - https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding - etc. It would be great to add usage instructions and various tips and...
It keeps bothering me to see these scripts in the source root. They cannot live anywhere except in the root of the repo, so therefore it is time to go....
The current `Q4_0` uses a single F32 floating-point scaling factor. An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial...
Currently, all `ggml` operations return the results in F32 format. The goal of this task is to see if there is an elegant way to add support for keeping the...
This is my understanding of how Flash Attention works based on this picture:  ref: https://github.com/HazyResearch/flash-attention The implementation is here: https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367 I don't plan on merging this because on M1...
The last tensor of the transformer (called `output` in llama.cpp) is one of the biggest ones: https://github.com/ggerganov/llama.cpp/blob/0ad964631f9b3970f1936008fcfb1eadef59c7ed/llama.cpp#L945 I wonder how the perplexity improves by keeping it in F16 format instead...
So, I haven't looked in details, but I suspect there might be something wrong in the new `large` model released by OpenAI. Keep in mind this is very anecdotal evidence...