llama.cpp
llama.cpp copied to clipboard
LLM inference in C/C++
Hi there! I found that I got this issue when trying to use some higher values of -b and -ub with DeepSeekV3, as doing so it increases the PP performance...
### Name and Version Custom build llama.cpp library from b5022 (older versions crash as well) ### Operating systems Windows ### GGML backends CUDA ### Hardware RTX 3080Ti, i7-12700F ### Models...
### Git commit 1682e39aa5bb1699fae3f760450be2e76d35a6a1 ### Operating systems Linux ### GGML backends CUDA ### Problem description & steps to reproduce Tell CMake where to find the compiler by setting either the...
I run into this issue on nearly almost every `-bf` file when using `llama-perplexity with --multiple-choice` Any idea on what happened or what should I do to fix this ?...
Hi, I'm currently facing this `tokenizer_name NotImplementedError` while testing quantized `.gguf`model with `[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)` I'm having this trouble with --apply_chat_template run command lm_eval --model gguf --model_args base_url=http://127.0.1.1:8080 --tasks gsm8k --output_path result/gsm8k...
### Name and Version Compiled at commit https://github.com/ggml-org/llama.cpp/commit/6562e5a4d6c58326dcd79002ea396d4141f1b18e, but it also happens on the latest master version. ### Operating systems Mac ### Which llama.cpp modules do you know to be...
Added two new configuration presets to simplify command-line usage: 1. --chat-llama3-8b-default for running a chat server with Llama3 8B model, 2. --rerank-bge-default for running a reranking server with the BGE...
In this PR: - Remove `libllava` - it contains too many redundant and unsafe code - the `libmtmd` already covers all use cases with a better API - Remove `clip-quantize-cli`...
This gives x1.5 generation speed for Qwen VL models (tested on Macbook M3 Max) `master` branch: | model | size | params | backend | threads | test | t/s...