ipex-llm [ARL] IPEX-LLM's perf drop with GenAI inference on Ollama compared to llama.cpp

Observed 10~15% perf drop on Ollama+IPEX-LLM compared to llama.cpp. Known indeed existing overhead from add-on Ollama framework, but still hope upstream could help clarify if there is any misunderstanding.

Ubuntu 24.04+kernel v6.16 (Xe driver) on ARL 255H (SODIMM DDR5-5600)
Test model: Qwen2.5-1.5b-Q4
Environment configuration for both tests,

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1

Use the same model's inference parameters configuration for both tests

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized

Run llama.cpp - 07/24 ubuntu version with promoting "写500字关于深圳的发展" for test model via following command and get result 57.35 tokens per second. ./llama-cli -m ./qwen2.5-1.5b-instruct-q4_k_m.gguf -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -ngl 999 -b 512 -n 512 -c 2048 --no-mmap -t 1 -ub 512

llama.cpp.log

Run Ollama - 07/25 ubuntu version with promoting "写500字关于深圳的发展" for test model via following commands and get result 52 tokens per second.

create Ollama's model based on original model: Qwen2.5-1.5b-Q4 file (avoid any modification from Ollama upstream)

FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
SYSTEM """
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
"""
PARAMETER num_thread 1

./ollama create my_qwen2.5:1.5b -f Modelfile
./ollama run my_qwen2.5:1.5b --verbose

ollama.log

Aug 08 '25 05:08 zcwang

Thanks for the information. We’ll check what gaps there are in Ollama.

Aug 11 '25 01:08 liu-shaojun

Add more info into issue description and thank you for help!

Aug 12 '25 00:08 zcwang