llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
### Name and Version build: 4761 (cad53fc9) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu ### Operating systems Linux ### Which llama.cpp modules do you know to be affected? llama-server ###...
### Name and Version ./llama-cli latest version ubuntu linux riscv ### Operating systems Linux ### Which llama.cpp modules do you know to be affected? llama-cli ### Problem description & steps...
### Feature Description Current llama.cpp implementation doesn't optimally utilize NUMA architecture when running Mixture-of-Experts (MoE) models, potentially leaving significant performance gains untapped. ### Proposed Solution Implement NUMA-aware expert allocation through...
Modeled after the CUDA implementations. Because of the use of `type4x4` I had no idea how to reuse the existing `dequantize` functions, so those are repeated here in `float` form....
This basically makes the mul_mm shaders load and dequantize 4 or 8 values at a time like how it's done in mat_vec (old quants only). Results on my RX 470:...
### Name and Version [root@localhost ~]# ~/llama.cpp/build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size:...
This PR adds the following optimizations to the CUDA FlashAttention code: * For models with group-query attention, re-use the loaded K/V data across multiple attention heads. This also has the...
### Name and Version ❯ llama-cli --version version: 4568 (a4417ddd) built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0 ### Operating systems Mac Studio M2 Ultra (192GB) ### Which llama.cpp...
This PR enables CI on Github-hosted arm64 runners that are now [available for free](https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/) in public repositories Related to #11275
Allows specifying a JSON schema by file (currently only flag is `-j` / `--json-schema` which takes the full schema itself as argument)