renbuarl comments

Results 6 comments of


                                            renbuarl

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is...

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

[journal.txt](https://github.com/user-attachments/files/16377575/journal.txt)

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

Speedway1, thank you for your message! However, it seems that the issue is not with ollama, but with llama.cpp. I built the latest release of llama.cpp #b3488 following the methodology...

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

https://github.com/ggerganov/llama.cpp/issues/8766

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

> Hi @renbuarl , I think that the problem there is your massive context length. Great advice to use the '--flash-attn' option. ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5'...

Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000

> The bug here is likely we're not properly adjusting the prediction for the large context size. What do we have? When launching without the --flash-attn option for llama-server ~/llama.cpp/llama-server...