renbuarl

Results 6 comments of renbuarl

Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is...

[journal.txt](https://github.com/user-attachments/files/16377575/journal.txt)

Speedway1, thank you for your message! However, it seems that the issue is not with ollama, but with llama.cpp. I built the latest release of llama.cpp #b3488 following the methodology...

> Hi @renbuarl , I think that the problem there is your massive context length. Great advice to use the '--flash-attn' option. ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5'...

> The bug here is likely we're not properly adjusting the prediction for the large context size. What do we have? When launching without the --flash-attn option for llama-server ~/llama.cpp/llama-server...