renbuarl
renbuarl
Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is...
[journal.txt](https://github.com/user-attachments/files/16377575/journal.txt)
Speedway1, thank you for your message! However, it seems that the issue is not with ollama, but with llama.cpp. I built the latest release of llama.cpp #b3488 following the methodology...
https://github.com/ggerganov/llama.cpp/issues/8766
> Hi @renbuarl , I think that the problem there is your massive context length. Great advice to use the '--flash-attn' option. ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5'...
> The bug here is likely we're not properly adjusting the prediction for the large context size. What do we have? When launching without the --flash-attn option for llama-server ~/llama.cpp/llama-server...