Memory leak? Windows, llama-ipex
Describe the bug
Using the prebuilt llama.cpp Portable Zip on Windows 11 with an Intel Arc A770 GPU (driver version 32.0.101.6734), there appears to be a memory leak. When running the model DeepSeek-R1-Distill-Qwen-14B-Q4_0, the llama-server.exe process gradually consumes increasing amounts of RAM over time as the model continues generating output. This eventually leads to system instability, including Explorer.exe becoming unresponsive or freezing.
How to reproduce
Steps to reproduce the error:
- Download and extract the prebuilt
llama.cppPortable Zip. - Launch
llama-server.exeusing the following command:llama-server.exe -m "%dir_path%\!file[%selection%]!" -c 9000 -ngl 99 --port 8000 - Load the model
DeepSeek-R1-Distill-Qwen-14B-Q4_0. - Make a generation request to the model.
- Monitor RAM usage of
llama-server.exe— it will keep increasing gradually. - Eventually, system performance degrades and
Explorer.exemay freeze. - Lowering the
-cparameter (e.g.-c 2048,-c 4096) does not affect the behavior — memory usage still grows continuously.
Environment information
- Windows 11
- Intel Arc A770 GPU
- GPU Driver: 32.0.101.6734
- Model:
DeepSeek-R1-Distill-Qwen-14B-Q4_0 - llama.cpp version: IPEX-LLM release 2.2.0 porable llana.cpp
Have you tried adjusting the -b prompt processing batch size? I believe IPEX-LLM Llama.cpp defaults it to 4096 which is rather memory intensive. This allows for faster prompt processing but takes up a lot more memory. Try setting it -b 1024 which is what I usually do. Doing it with Llama 3.1 8B with a prompt size of around 10K tokens lowers the memory usage from ~32 GB to a much more manageable size of ~10 GB.
@Sketchfellow I tried it, but the llama-server process still gradually eats up memory. The GPU constantly offloads something into system memory as tokens are generated, and since reasoning models generate tons of tokens, the memory fills up very quickly.
Have you tried adjusting the
-bprompt processing batch size? I believe IPEX-LLM Llama.cpp defaults it to 4096 which is rather memory intensive. This allows for faster prompt processing but takes up a lot more memory. Try setting it-b 1024which is what I usually do. Doing it with Llama 3.1 8B with a prompt size of around 10K tokens lowers the memory usage from ~32 GB to a much more manageable size of ~10 GB.
that is it, I also found that batch over 2048 eats up all memory(64G in total) and leads to crash in seconds. I believe this is a ipex-only bug since the original llama-server works just fine.
edit: This happens when the prompt tokens to process really reaches the batch, like when you open an old chat.
Ollama just pre-released v0.6.6 which includes a fix for a memory leak https://github.com/ollama/ollama/issues/10040