Georgi Gerganov issues

Results 113 issues of


Georgi Gerganov

2-bit integer quantization

Add `Q2_0` and `Q2_1` quantization support to `ggml`: - Follow the existing `Q4_0` and `Q4_1` implementations - Implement [reference scalar quantization and dequantization routines](https://github.com/ggerganov/llama.cpp/blob/3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0/ggml.c#L407-L449) - I suspect we might have...

enhancement

research 🔬

Eliminate `ggml_forward_mul_mat_xxx()` branch for non-contiguous `src0`

See explanation here: https://github.com/ggerganov/llama.cpp/pull/439

enhancement

Create "instruct" example

Currently, the `main` example has a `instruct` parameter which enables something similar to instruction-based mode. I haven't understood it completely, but this seems to be what the Alpaca models are...

enhancement

help wanted

good first issue

🦙.

Help populating the examples README.md files

For now I just added empty README.md files: - https://github.com/ggerganov/llama.cpp/tree/master/examples/main - https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize - https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity - https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding - etc. It would be great to add usage instructions and various tips and...

documentation

help wanted

good first issue

Move the Flake scripts to a separate repository

It keeps bothering me to see these scripts in the source root. They cannot live anywhere except in the root of the repo, so therefore it is time to go....

help wanted

good first issue

build

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors

The current `Q4_0` uses a single F32 floating-point scaling factor. An idea was proposed by @ikawrakow to change this to use 2x F16 factors instead of 1x F32: https://github.com/ggerganov/llama.cpp/commit/679e1cb6c01b16abe4f3ee3c849813b98970df93 Initial...

help wanted

high priority

research 🔬

Investigate storing results from ggml operations in F16 format

Currently, all `ggml` operations return the results in F32 format. The goal of this task is to see if there is an elegant way to add support for keeping the...

help wanted

performance

high priority

research 🔬

Demo usage of Flash Attention

This is my understanding of how Flash Attention works based on this picture: ![image](https://user-images.githubusercontent.com/1991296/230129853-01def052-9f27-48f9-846a-4ee74103caab.png) ref: https://github.com/HazyResearch/flash-attention The implementation is here: https://github.com/ggerganov/llama.cpp/blob/flash-attn/ggml.c#L8122-L8367 I don't plan on merging this because on M1...

demo

Measure perplexity delta between Q4_0 and F16 "output" tensor

The last tensor of the transformer (called `output` in llama.cpp) is one of the biggest ones: https://github.com/ggerganov/llama.cpp/blob/0ad964631f9b3970f1936008fcfb1eadef59c7ed/llama.cpp#L945 I wonder how the perplexity improves by keeping it in F16 format instead...

help wanted

good first issue

high priority

generation quality

Something wrong with OpenAI's Large V2 model?

So, I haven't looked in details, but I suspect there might be something wrong in the new `large` model released by OpenAI. Keep in mind this is very anecdotal evidence...

question