Georgi Gerganov issues

Results 163 issues of


                                            Georgi Gerganov

kv-cache : separate recurrent vs non-recurrent impl

### Overview Attempting to make two separate classes for the 2 types of KV cache: - `llama_kv_cache_unified : llama_kv_cache` - `llama_kv_cache_recurrent : llama_kv_cache` ```mermaid graph TD; llama_memory_i --> llama_kv_cache llama_kv_cache...

context : allow cache-less context for embeddings

target #12799 There is no need to create a KV cache when using embeddings-only models such as BERT.

examples

server

[sync #10544] llama/ggml: add LLM training support

original #10544 target #12799 This is a rebase of the #10544 PR by @JohannesGaessler on top of the upcoming #12799. The purpose is only to highlight the necessary changes that...

testing

examples

ggml

server : add support for file upload to the Web UI

### Prerequisites - [x] I am running the latest code. Mention the version if possible as well. - [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md). - [x] I searched using keywords...

enhancement

help wanted

good first issue

server/webui

kv-cache : add SWA support

TODO: write detailed description of the changes. outdated This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention...

examples

server

changelog : `llama-server` REST API

# Overview This is a list of changes to the public HTTP interface of the `llama-server` example. Collaborators are encouraged to edit this post in order to reflect important changes...

documentation

roadmap

sync : whisper.cpp

rpc : fix alloc size logic

fix #16657 ref https://github.com/ggml-org/llama.cpp/pull/16276#pullrequestreview-3287676108 This fixes the RPC inference when Metal backend is involved. Testing: ```bash # server make -j && ./bin/rpc-server # cli make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf...

ggml

Apple Metal

ggml : remove KQ mask padding

target #16148 save `gg/fa-no-kq-pad-save` Gauging what would it take to remove the KQ mask padding along the batch dimension (`ne31`). Removing this padding would simplify the graph building logic and...

Nvidia GPU

ggml

metal : make the FA extra sizes consistent

While looking into https://github.com/ggml-org/llama.cpp/issues/17033#issuecomment-3508789138 found that the warmup in `llama-batched-bench` trashes the worst-case graph allocation using the Metal backend, causing extra graph allocations later on. The reason is because the...

ggml

Apple Metal