Georgi Gerganov

Results 113 issues of Georgi Gerganov

ref https://twitter.com/awnihannun/status/1777072588633882741 This branch starts from the flash-attention branch (#5021, #6508). To perform a benchmark for the challenge, run: ```bash # generate pure 4-bit model ./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0...

performance
review complexity : high

There is functionality around `llama_sampling_context` currently part of `common`. We should move it into `llama`. Pretty much the entire API from `common/sampling.h` except `llama_sampling_params` and `llama_sampling_sample` can be integrated into...

refactoring

Disabled temporary to avoid failure notifications https://github.com/ggerganov/llama.cpp/pull/6128

testing
stale

fix #6770 Setting `special == true` in `llama_token_to_piece()` will cause special/control tokens' text to be rendered in the output: https://github.com/ggerganov/llama.cpp/blob/1f45c2adc7b10637c2035e622573f1851e403979/llama.h#L827-L837

Still not familiar with the details, but it seems it would be useful to support this architecture in `llama.cpp`. First, need to decide on the API and see what changes...

model

There have been a few reports where the grammar sampling can significantly degrade the performance. It would be nice to profile and optimize the implementation - there should be room...

performance
refactoring

ref #6849 Modelling this as LLaMA model since it is the same arch Model: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct ```bash # convert to GGUF python3 convert-hf-to-gguf.py ~/Data/huggingface/Phi-3-mini-4k-instruct/ --outfile models/phi-3-4k-instruct/ggml-model-f16.gguf --outtype f16 ``` ```bash #...

Currently, we always pass `b` to `ggml_mul_mat` as F32 and internally quantize it depending on the type of `a`. There is no option that allows to pass an already quantized...

refactoring

Currently, the padded matrix multiplications in `whisper.cpp` are silently failing with CUDA: https://github.com/ggerganov/ggml/blob/dbd02958fa4f46898f68ca29c27ddcdc58a06f98/examples/whisper/whisper.cpp#L224-L230 The reason is that the `to_fp16_cuda` and `to_fp32_cuda` calls assume no padding of the data. We can...

Now that distributed inference is supported thanks to the work of @evanmiller in #2099 it would be fun to try to utilize it for something cool. One such idea is...

help wanted
🦙.
hardware
research 🔬