Georgi Gerganov

Results 113 issues of Georgi Gerganov

Add `Q2_0` and `Q2_1` quantization support to `ggml`: - Follow the existing `Q4_0` and `Q4_1` implementations - Implement [reference scalar quantization and dequantization routines](https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L407-L449) - I suspect we might have...

enhancement
research 🔬

See explanation here: https://github.com/ggerganov/llama.cpp/pull/439

enhancement

Currently, the `main` example has a `instruct` parameter which enables something similar to instruction-based mode. I haven't understood it completely, but this seems to be what the Alpaca models are...

enhancement
help wanted
good first issue
🦙.

For now I just added empty README.md files: - https://github.com/ggerganov/llama.cpp/tree/master/examples/main - https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize - https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity - https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding - etc. It would be great to add usage instructions and various tips and...

documentation
help wanted
good first issue

It keeps bothering me to see these scripts in the source root. They cannot live anywhere except in the root of the repo, so therefore it is time to go....

help wanted
good first issue
build

The current `Q4_0` uses a single F32 floating-point scaling factor. An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial...

help wanted
high priority
research 🔬

Currently, all `ggml` operations return the results in F32 format. The goal of this task is to see if there is an elegant way to add support for keeping the...

help wanted
performance
high priority
research 🔬

This is my understanding of how Flash Attention works based on this picture: ![image](https://user-images.githubusercontent.com/1991296/230129853-01def052-9f27-48f9-846a-4ee74103caab.png) ref: https://github.com/HazyResearch/flash-attention The implementation is here: https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367 I don't plan on merging this because on M1...

demo

The last tensor of the transformer (called `output` in llama.cpp) is one of the biggest ones: https://github.com/ggerganov/llama.cpp/blob/0ad964631f9b3970f1936008fcfb1eadef59c7ed/llama.cpp#L945 I wonder how the perplexity improves by keeping it in F16 format instead...

help wanted
good first issue
high priority
generation quality

So, I haven't looked in details, but I suspect there might be something wrong in the new `large` model released by OpenAI. Keep in mind this is very anecdotal evidence...

question