John
John
Just a heads up, given it's more than a week since last release. I'm deep in a complete overhaul of a series of behavior and functions. The core focus is...
A large patch was just integrated into llama.cpp (https://github.com/ggerganov/llama.cpp/pull/2001) another stunning job by @ikawrakow In the long run we need it, K quants are better for 7B and have more...
Opening this as a ticket as this is quite a large thing to solve. We still suffer a significant slowdown compared to the fast speed for the first 1-2k context....
I'm currently working on the tokenizer, we need a new one. The llama tokenizer is not suitable, it has problems forming larger tokens and favors smaller ones and it does...
With each token processed the inference speed slows down a little bit, starts to become noticeable at around 50 tokens on 40B Q3_K and adds up.
### What happened? Moondream2 is a superb vision model, however on llama.cpp it performs at quality below vanilla llava-1 @vikhyat maybe you'd like to take a look ? I compared...
### What happened? This is a bit discussed here already: https://github.com/ggerganov/llama.cpp/issues/7938 ` ` ``` 32001 -> '' 259 -> ' ' ``` Also `\n`: ``` 32001 -> '' 29871 ->...
I believe it's 13 different samplers now and we keep getting more added, I am very sure that the vast majority if not almost everyone does not understand the differences...
### Name and Version all versions 2025 ### Operating systems Linux ### GGML backends CUDA, Metal ### Hardware any ### Models _No response_ ### Problem description & steps to reproduce...
### Name and Version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce...