ipex-llm Memory leak? Windows, llama-ipex

Describe the bug
Using the prebuilt llama.cpp Portable Zip on Windows 11 with an Intel Arc A770 GPU (driver version 32.0.101.6734), there appears to be a memory leak. When running the model DeepSeek-R1-Distill-Qwen-14B-Q4_0, the llama-server.exe process gradually consumes increasing amounts of RAM over time as the model continues generating output. This eventually leads to system instability, including Explorer.exe becoming unresponsive or freezing.

How to reproduce
Steps to reproduce the error:

Download and extract the prebuilt llama.cpp Portable Zip.

Launch llama-server.exe using the following command:

llama-server.exe -m "%dir_path%\!file[%selection%]!" -c 9000 -ngl 99 --port 8000

Load the model DeepSeek-R1-Distill-Qwen-14B-Q4_0.
Make a generation request to the model.
Monitor RAM usage of llama-server.exe — it will keep increasing gradually.
Eventually, system performance degrades and Explorer.exe may freeze.
Lowering the -c parameter (e.g. -c 2048, -c 4096) does not affect the behavior — memory usage still grows continuously.

Environment information

Windows 11
Intel Arc A770 GPU
GPU Driver: 32.0.101.6734
Model: DeepSeek-R1-Distill-Qwen-14B-Q4_0
llama.cpp version: IPEX-LLM release 2.2.0 porable llana.cpp

Apr 10 '25 19:04 characharm

Have you tried adjusting the -b prompt processing batch size? I believe IPEX-LLM Llama.cpp defaults it to 4096 which is rather memory intensive. This allows for faster prompt processing but takes up a lot more memory. Try setting it -b 1024 which is what I usually do. Doing it with Llama 3.1 8B with a prompt size of around 10K tokens lowers the memory usage from ~32 GB to a much more manageable size of ~10 GB.

Apr 11 '25 06:04 Sketchfellow

@Sketchfellow I tried it, but the llama-server process still gradually eats up memory. The GPU constantly offloads something into system memory as tokens are generated, and since reasoning models generate tons of tokens, the memory fills up very quickly.

Apr 11 '25 15:04 characharm

Have you tried adjusting the -b prompt processing batch size? I believe IPEX-LLM Llama.cpp defaults it to 4096 which is rather memory intensive. This allows for faster prompt processing but takes up a lot more memory. Try setting it -b 1024 which is what I usually do. Doing it with Llama 3.1 8B with a prompt size of around 10K tokens lowers the memory usage from ~32 GB to a much more manageable size of ~10 GB.

that is it, I also found that batch over 2048 eats up all memory(64G in total) and leads to crash in seconds. I believe this is a ipex-only bug since the original llama-server works just fine.

edit: This happens when the prompt tokens to process really reaches the batch, like when you open an old chat.

Apr 14 '25 05:04 truncle

Ollama just pre-released v0.6.6 which includes a fix for a memory leak https://github.com/ollama/ollama/issues/10040

Apr 19 '25 15:04 tyzbit