Georgi Gerganov
Georgi Gerganov
### Overview Attempting to make two separate classes for the 2 types of KV cache: - `llama_kv_cache_unified : llama_kv_cache` - `llama_kv_cache_recurrent : llama_kv_cache` ```mermaid graph TD; llama_memory_i --> llama_kv_cache llama_kv_cache...
target #12799 There is no need to create a KV cache when using embeddings-only models such as BERT.
original #10544 target #12799 This is a rebase of the #10544 PR by @JohannesGaessler on top of the upcoming #12799. The purpose is only to highlight the necessary changes that...
### Prerequisites - [x] I am running the latest code. Mention the version if possible as well. - [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md). - [x] I searched using keywords...
TODO: write detailed description of the changes. outdated This is still very WIP - the goal is to redesign the unified KV cache to properly support layers with sliding-window attention...
# Overview This is a list of changes to the public HTTP interface of the `llama-server` example. Collaborators are encouraged to edit this post in order to reflect important changes...
fix #16657 ref https://github.com/ggml-org/llama.cpp/pull/16276#pullrequestreview-3287676108 This fixes the RPC inference when Metal backend is involved. Testing: ```bash # server make -j && ./bin/rpc-server # cli make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf...
target #16148 save `gg/fa-no-kq-pad-save` Gauging what would it take to remove the KQ mask padding along the batch dimension (`ne31`). Removing this padding would simplify the graph building logic and...
While looking into https://github.com/ggml-org/llama.cpp/issues/17033#issuecomment-3508789138 found that the warmup in `llama-batched-bench` trashes the worst-case graph allocation using the Metal backend, causing extra graph allocations later on. The reason is because the...