llama.cpp issues

Misc. bug: add tool_calls id in response in server

### Name and Version build: 4761 (cad53fc9) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu ### Operating systems Linux ### Which llama.cpp modules do you know to be affected? llama-server ###...

henryclw

bug-unconfirmed

Misc. bug: RISCV output bug when using rvv with vlen > 256bit

3

### Name and Version ./llama-cli latest version ubuntu linux riscv ### Operating systems Linux ### Which llama.cpp modules do you know to be affected? llama-cli ### Problem description & steps...

grigohas

bug-unconfirmed

stale

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

1

### Feature Description Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped. ### Proposed Solution Implement NUMA-aware expert allocation through...

l15y

enhancement

stale

metal: Copy kernels for quant to F32 conversions (#10976).

1

Modeled after the CUDA implementations. Because of the use of `type4x4` I had no idea how to reuse the existing `dequantize` functions, so those are repeated here in `float` form....

gcp

ggml

Apple Metal

vulkan: matmul dequantization improvements

1

This basically makes the mul_mm shaders load and dequantize 4 or 8 values at a time like how it's done in mat_vec (old quants only). Results on my RX 470:...

netrunnereve

Vulkan

ggml

Eval bug: Several models producing gibberish

5

### Name and Version [root@localhost ~]# ~/llama.cpp/build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size:...

iamangus

bug-unconfirmed

CUDA: optimize FA for GQA + large batches

This PR adds the following optimizations to the CUDA FlashAttention code: * For models with group-query attention, re-use the loaded K/V data across multiple attention heads. This also has the...

JohannesGaessler

testing

Nvidia GPU

python

ggml

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6

1

### Name and Version ❯ llama-cli --version version: 4568 (a4417ddd) built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0 ### Operating systems Mac Studio M2 Ultra (192GB) ### Which llama.cpp...

karanotsingyu

bug-unconfirmed

Run CI on Github-hosted arm64 runners too

This PR enables CI on Github-hosted arm64 runners that are now [available for free](https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/) in public repositories Related to #11275

Rohanjames1997

devops

`common`: add -jf / --json-schema-file flag

Allows specifying a JSON schema by file (currently only flag is `-j` / `--json-schema` which takes the full schema itself as argument)

ochafik

llama.cpp
llama.cpp copied to clipboard

Metadata

Misc. bug: add tool_calls id in response in server

Misc. bug: RISCV output bug when using rvv with vlen > 256bit

Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc

metal: Copy kernels for quant to F32 conversions (#10976).

vulkan: matmul dequantization improvements

Eval bug: Several models producing gibberish

CUDA: optimize FA for GQA + large batches

Misc. bug: Concurrency Limitation: Only 6 Inferences Run Simultaneously When Setting `--parallel` > 6

Run CI on Github-hosted arm64 runners too

`common`: add -jf / --json-schema-file flag

← Metadata

Owner

Metadata

llama.cpp llama.cpp copied to clipboard

Metadata

← Metadata

Owner

Metadata

llama.cpp
llama.cpp copied to clipboard