ik_llama.cpp icon indicating copy to clipboard operation
ik_llama.cpp copied to clipboard

llama.cpp fork with additional SOTA quants and improved performance

Results 44 ik_llama.cpp issues
Sort by recently updated
recently updated
newest added

I only have a 2xGPU system, so no way to test the best graph splitting strategy on a multi-GPU system. On the main branch I'm forcing a second graph split...

This change seems to result in slightly better TG performance with split mode "graph" and tensor overrides. Basically, for TG just remove the forced graph split when combining partial shared...

### Prerequisites - [x] I am running the latest code. Mention the version if possible as well. - [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md). - [x] I searched using keywords...

enhancement

First, thank you for maintaining this project — it has been very useful, and I appreciate the work that has gone into it. I initially created a fork to add...

### What happened? When a client disconnects while llama-server is still processing the prompt (before any token is streamed), the server continues running the generation until completion. This wastes compute...

### What happened? The tool calls seems to be broken? ### Name and Version llama-server --version version: 3872 (f8d511a3) built with cc (Debian 14.3.0-5) 14.3.0 for x86_64-linux-gnu ### What operating...

### What happened? There is a Segfault with spec. decoding for a sufficiently large prompt ( ' pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | mods -m g "explain the code"' ). ``` /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server --model...

### What happened? When trying to convert https://huggingface.co/moonshotai/Kimi-K2-Thinking to BF16 using this command: ``` python3 ~/pkgs/ik_llama.cpp/convert_hf_to_gguf.py --outtype bf16 \ --outfile /mnt/Toshiba_Canvio_4TB_Top_Left/neuro/Kimi-K2-Thinking-BF16/Kimi-K2-Thinking-BF16.gguf \ /mnt/Toshiba_Canvio_4TB_Top_Left/neuro/Kimi-K2-Thinking --split-max-size 50G ``` ...it fails (please check...

wontfix

### What happened? I was comparing the output of DeepSeek vs GLM-4.5 when I isolated a case where llama-server repeatedly fails when these parameters are passed: --attention-max-batch 2048 --batch-size 16384...

### What happened? If I increase the context size and have to decrease `-ngl` so part of layers are offloaded into ram it crashes when receives the first request from...