ipex-llm ultra5 125H/ultra7 255H 在不同版本ollama/llama-cpp上的推理速度差异较大

ultra5 125H/ultra7 255H 在不同版本ollama/llama-cpp上的推理速度差异较大

6011: Ultra7 125H 6011 pro: Ultra7 255H

对比相同的qwen2.5-1.5b模型，ollama / llama-cpp 推理速度相差50%以上

Jul 11 '25 09:07 szzzh

@rnwang04 any comments?

Jul 14 '25 06:07 hli25

你好，255H在windows 11上ollama run qwen2.5:1.5b 约 56 token/s

请用最新版本的ollama-intel-2.3.0b20250630-ubuntu.tgz ：https://www.modelscope.cn/models/Intel/ollama/files 并且 clinfo | grep "Driver Version" 提供一下255H和125H驱动版本

以及llama-cpp-ipex-llm 版本的下载链接。

Jul 15 '25 02:07 KiwiHana

Hello all, I have the same ARL 255H+DDR5-5600 (GPU runtime: 25.22.33944.8 via Linux kernel 6.16.0-rc6 [Linux]) environment then test them quickly via 3 version of IPEX -LLM (for llama.cpp). The difference in test results might come from the parameter default settings of the model for inference testing…

ollama-ipex-llm-2.3.0b20250710-ubuntu, ollama-ipex-llm-2.3.0b20250630-ubuntu -->n_seq_max=2, n_ctx =4096, n_ctx_per_seq=2048...
- deepseek-r1:7b: 15.97 tokens/s
- qwen2.5:1.5b-instruct: 47.51 tokens/s
ollama-ipex-llm-2.2.0-ubuntu. --> n_seq_max=1, n_ctx= 2048…
- deepseek-r1:7b: 15.78 tokens/s
- qwen2.5:1.5b-instruct: 50.54 tokens/s

Please help to test again based on ollama-ipex-llm-2.3.0b20250710-ubuntu with the latest GPU runtime in ARL-H and share test result for reference...

Thanks Gary

Jul 17 '25 07:07 zcwang