ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

ultra5 125H/ultra7 255H 在不同版本ollama/llama-cpp上的推理速度差异较大

Open szzzh opened this issue 5 months ago • 3 comments

ultra5 125H/ultra7 255H 在不同版本ollama/llama-cpp上的推理速度差异较大

Image Image

6011: Ultra7 125H 6011 pro: Ultra7 255H

对比相同的qwen2.5-1.5b模型,ollama / llama-cpp 推理速度相差50%以上

szzzh avatar Jul 11 '25 09:07 szzzh

@rnwang04 any comments?

hli25 avatar Jul 14 '25 06:07 hli25

你好,255H在windows 11上ollama run qwen2.5:1.5b 约 56 token/s

请用最新版本的ollama-intel-2.3.0b20250630-ubuntu.tgz :https://www.modelscope.cn/models/Intel/ollama/files 并且 clinfo | grep "Driver Version" 提供一下255H和125H驱动版本

以及llama-cpp-ipex-llm 版本的下载链接。

KiwiHana avatar Jul 15 '25 02:07 KiwiHana

Hello all, I have the same ARL 255H+DDR5-5600 (GPU runtime: 25.22.33944.8 via Linux kernel 6.16.0-rc6 [Linux]) environment then test them quickly via 3 version of IPEX -LLM (for llama.cpp). The difference in test results might come from the parameter default settings of the model for inference testing…

  • ollama-ipex-llm-2.3.0b20250710-ubuntu, ollama-ipex-llm-2.3.0b20250630-ubuntu -->n_seq_max=2, n_ctx =4096, n_ctx_per_seq=2048...
    • deepseek-r1:7b: 15.97 tokens/s
    • qwen2.5:1.5b-instruct: 47.51 tokens/s
  • ollama-ipex-llm-2.2.0-ubuntu. --> n_seq_max=1, n_ctx= 2048…
    • deepseek-r1:7b: 15.78 tokens/s
    • qwen2.5:1.5b-instruct: 50.54 tokens/s

Please help to test again based on ollama-ipex-llm-2.3.0b20250710-ubuntu with the latest GPU runtime in ARL-H and share test result for reference...

Thanks Gary

zcwang avatar Jul 17 '25 07:07 zcwang