ik_llama.cpp issues

Is this better for multi-GPU and split mode "graph"?

I only have a 2xGPU system, so no way to test the best graph splitting strategy on a multi-GPU system. On the main branch I'm forcing a second graph split...

ikawrakow

Slightly better graph split strategy

This change seems to result in slightly better TG performance with split mode "graph" and tensor overrides. Basically, for TG just remove the forced graph split when combining partial shared...

ikawrakow

Feature Request: DeepSeek-V3.2-Exp support?

2

### Prerequisites - [x] I am running the latest code. Mention the version if possible as well. - [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md). - [x] I searched using keywords...

kkontosis

enhancement

Add Windows Build Workflow (GitHub Actions)

6

First, thank you for maintaining this project — it has been very useful, and I appreciate the work that has gone into it. I initially created a fork to add...

Thireus

Bug: llama-server won't stop generation when client disconnects during prompt processing

1

### What happened? When a client disconnects while llama-server is still processing the prompt (before any token is streamed), the server continues running the generation until completion. This wastes compute...

hksdpc255

Bug: Partial parse: incomplete tool call

15

### What happened? The tool calls seems to be broken? ### Name and Version llama-server --version version: 3872 (f8d511a3) built with cc (Debian 14.3.0-5) 14.3.0 for x86_64-linux-gnu ### What operating...

magikRUKKOLA

Bug: Segfault spec. decoding, Gigachat

### What happened? There is a Segfault with spec. decoding for a sufficiently large prompt ( ' pre /opt/GGUF-Tool-Suite/GGUF-Tool-Suite/quant_assign.py | mods -m g "explain the code"' ). ``` /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server --model...

magikRUKKOLA

Bug: conversion to BF16 fails for Kimi K2 Thinking

16

### What happened? When trying to convert https://huggingface.co/moonshotai/Kimi-K2-Thinking to BF16 using this command: ``` python3 ~/pkgs/ik_llama.cpp/convert_hf_to_gguf.py --outtype bf16 \ --outfile /mnt/Toshiba_Canvio_4TB_Top_Left/neuro/Kimi-K2-Thinking-BF16/Kimi-K2-Thinking-BF16.gguf \ /mnt/Toshiba_Canvio_4TB_Top_Left/neuro/Kimi-K2-Thinking --split-max-size 50G ``` ...it fails (please check...

Lissanro

wontfix

Bug: Crash in FlashQKV<128, 8, 128>::normalize_and_store_1row failing with GLM 4.5, AVX2, -amb/-ub/-b

5

### What happened? I was comparing the output of DeepSeek vs GLM-4.5 when I isolated a case where llama-server repeatedly fails when these parameters are passed: --attention-max-batch 2048 --batch-size 16384...

os360

Bug: Crash with runtime tensor repack if some layers are offloaded to ram

### What happened? If I increase the context size and have to decrease `-ngl` so part of layers are offloaded into ram it crashes when receives the first request from...

moooV252

ik_llama.cpp
ik_llama.cpp copied to clipboard

Metadata

Is this better for multi-GPU and split mode "graph"?

Slightly better graph split strategy

Feature Request: DeepSeek-V3.2-Exp support?

Add Windows Build Workflow (GitHub Actions)

Bug: llama-server won't stop generation when client disconnects during prompt processing

Bug: Partial parse: incomplete tool call

Bug: Segfault spec. decoding, Gigachat

Bug: conversion to BF16 fails for Kimi K2 Thinking

Bug: Crash in FlashQKV<128, 8, 128>::normalize_and_store_1row failing with GLM 4.5, AVX2, -amb/-ub/-b

Bug: Crash with runtime tensor repack if some layers are offloaded to ram

← Metadata

Owner

Metadata

ik_llama.cpp ik_llama.cpp copied to clipboard

Metadata

← Metadata

Owner

Metadata

ik_llama.cpp
ik_llama.cpp copied to clipboard