ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

[ARL] IPEX-LLM's perf drop with GenAI inference on Ollama compared to llama.cpp

Open zcwang opened this issue 5 months ago • 2 comments

Observed 10~15% perf drop on Ollama+IPEX-LLM compared to llama.cpp. Known indeed existing overhead from add-on Ollama framework, but still hope upstream could help clarify if there is any misunderstanding.

  • Ubuntu 24.04+kernel v6.16 (Xe driver) on ARL 255H (SODIMM DDR5-5600)
  • Test model: Qwen2.5-1.5b-Q4
  • Environment configuration for both tests,
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
  • Use the same model's inference parameters configuration for both tests
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
  • Run llama.cpp - 07/24 ubuntu version with promoting "写500字关于深圳的发展" for test model via following command and get result 57.35 tokens per second. ./llama-cli -m ./qwen2.5-1.5b-instruct-q4_k_m.gguf -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -ngl 999 -b 512 -n 512 -c 2048 --no-mmap -t 1 -ub 512
Image

llama.cpp.log

  • Run Ollama - 07/25 ubuntu version with promoting "写500字关于深圳的发展" for test model via following commands and get result 52 tokens per second.
  1. create Ollama's model based on original model: Qwen2.5-1.5b-Q4 file (avoid any modification from Ollama upstream)
FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
SYSTEM """
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
"""
PARAMETER num_thread 1
  1. ./ollama create my_qwen2.5:1.5b -f Modelfile
  2. ./ollama run my_qwen2.5:1.5b --verbose
Image

ollama.log

zcwang avatar Aug 08 '25 05:08 zcwang

Thanks for the information. We’ll check what gaps there are in Ollama.

liu-shaojun avatar Aug 11 '25 01:08 liu-shaojun

Add more info into issue description and thank you for help!

zcwang avatar Aug 12 '25 00:08 zcwang