ipex-llm
ipex-llm copied to clipboard
[ARL] IPEX-LLM's perf drop with GenAI inference on Ollama compared to llama.cpp
Observed 10~15% perf drop on Ollama+IPEX-LLM compared to llama.cpp. Known indeed existing overhead from add-on Ollama framework, but still hope upstream could help clarify if there is any misunderstanding.
- Ubuntu 24.04+kernel v6.16 (Xe driver) on ARL 255H (SODIMM DDR5-5600)
- Test model: Qwen2.5-1.5b-Q4
- Environment configuration for both tests,
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
- Use the same model's inference parameters configuration for both tests
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
- Run llama.cpp - 07/24 ubuntu version with promoting "写500字关于深圳的发展" for test model via following command and get result 57.35 tokens per second.
./llama-cli -m ./qwen2.5-1.5b-instruct-q4_k_m.gguf -co -cnv -p "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." -ngl 999 -b 512 -n 512 -c 2048 --no-mmap -t 1 -ub 512
- Run Ollama - 07/25 ubuntu version with promoting "写500字关于深圳的发展" for test model via following commands and get result 52 tokens per second.
- create Ollama's model based on original model: Qwen2.5-1.5b-Q4 file (avoid any modification from Ollama upstream)
FROM ./qwen2.5-1.5b-instruct-q4_k_m.gguf
SYSTEM """
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
"""
PARAMETER num_thread 1
- ./ollama create my_qwen2.5:1.5b -f Modelfile
- ./ollama run my_qwen2.5:1.5b --verbose
Thanks for the information. We’ll check what gaps there are in Ollama.
Add more info into issue description and thank you for help!