ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Deepseek R1 output gibberish with latest (2.3.0) docker image

Open hurui200320 opened this issue 6 months ago • 10 comments

Describe the bug Hi, I upgrade to the latest image (intelanalytics/ipex-llm-inference-cpp-xpu:2.3.0-SNAPSHOT, which, for now is also the latest one and noticed that deepseek r1 is output meaningless text.

How to reproduce Steps to reproduce the error:

  1. create a container with intelanalytics/ipex-llm-inference-cpp-xpu:2.3.0-SNAPSHOT
  2. then inside that docker, use init-ollama to initialize ollama
  3. run /llm/ollama/ollama run deepseek-r1:8b and type anything
  4. saw the meaning less output

Screenshots

Here is a copy of console output:

root@3689bd685780:/llm/ollama# ./ollama run deepseek-r1:8b
>>> Hi
G. Hmm, the answer.

 ?, is a ( findone is in
 which is to. So we. Let's''

_REF:

 as. If I think>

Wait. For more

 ( S and have, let's: the problem?

 the logic.

OO1  ( the problem? Wait, the correct. 

But how it, the number.

aaaa. So for the first in.

 of the code:

^C

>>> 
root@3689bd685780:/llm/ollama# ./ollama -v
ollama version is 0.9.3

Environment information If possible, please attach the output of the environment check script, using:

  • https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.bat, or
  • https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.sh
root@3689bd685780:/llm/scripts# bash env-check.sh 
-----------------------------------------------------------------
PYTHON_VERSION=3.11.13
-----------------------------------------------------------------
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
transformers=4.36.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250704
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        42 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               20
On-line CPU(s) list:                  0-19
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i5-13500
CPU family:                           6
Model:                                191
Thread(s) per core:                   2
Core(s) per socket:                   14
Socket(s):                            1
Stepping:                             2
CPU max MHz:                          4800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4992.00
-----------------------------------------------------------------
Total CPU Memory: 62.0949 GB
Memory Type: sudo: dmidecode: command not found
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.5 LTS \n \l

-----------------------------------------------------------------
Linux 3689bd685780 6.12.24-Unraid #1 SMP PREEMPT_DYNAMIC Sat May  3 00:12:52 PDT 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu                             1.6.32224.5                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-legacy1                     1.3.30872.22                            amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-devel                                 1.20.2                                  amd64        oneAPI Level Zero
-----------------------------------------------------------------
igpu detected
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) UHD Graphics 770 12.2.0 [1.6.32224.500000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO  [24.52.32224.5]
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
root@3689bd685780:/llm/scripts# 

Additional context

Using stack from https://github.com/eleiton/ollama-intel-arc/blob/main/docker-compose.yml, host linux is openSUSE TW 20250702 build.

Other models like llama3.1:8b and gemma3:12b works with no issue. From all models I have downloaded, deepseek r1 is the only one that doesn't working (only tested 8b, not sure about others).

hurui200320 avatar Jul 05 '25 12:07 hurui200320

Hi @hurui200320 , with https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/ollama-ipex-llm-2.3.0b20250630-ubuntu.tgz, ollama run deepseek-r1:8b works fine on linux ARC dGPU.

arda@arda-arc:~/ollama-ipex-llm-2.3.0b20250630-ubuntu$ ./ollama run deepseek-r1:8b
>>> hi
Thinking...
嗯,用户只发了一个简单的问候“hi”,看起来像是想开启对话但还没想好具体要说什么。这种打招呼的方式很常见,可能ta刚打开聊天界面还在试探功能边
界,或者单纯想测试响应速度。

考虑到这是典型的非正式开场白,需要保持友好轻松的氛围来降低用户的紧张感。用表情符号可以快速建立亲和力,而开放式提问能给用户足够的接话空间
——既不能预设太多限制让ta觉得被绑架,又要比机械回复多提供几个认知锚点。

选项设计成生活化场景比较好:学习、工作、闲聊分别覆盖了求知型和技术型用户的潜在需求。用“随便聊聊”对应情感需求,避免把所有对话都框定在解决
问题的模式里。最后加个爱心符号强化温暖感——毕竟冰冷的知识库加上笑脸总比单纯问答更有人情味。

用户此刻可能处于两种状态:要么是带着明确目标来闲聊(比如想测试AI反应),要么真的没头绪只是随手发了个消息。所以回复要像打开门一样,既欢迎
有目的来的访客,也接纳迷路的路人。
...done thinking.

你好呀!😊  
我是DeepSeek-R1助手,随时为你效劳~你可以问学习上的问题、查资料、写文案,或者随便聊聊生活日常哦~

有什么我可以帮你的吗?比如:
- 学习/考试相关的问题?
- 想了解某个知识点或科技资讯?
- 需要工作文档的撰写建议?
- 还是想轻松地聊聊天?

都可以哟~❤

Are you running it on UHD iGPU ? Because in your env info, I only saw igpu information. If you have dGPU available, maybe you can collect related info with xpu-smi installed and run the env check script with sudo. And could you please provide us with detailed output log from ollama server side ?

rnwang04 avatar Jul 07 '25 10:07 rnwang04

Same issue in my Windows 10 environment.

It is OK for llama3.2, gemma3, qwen2.5, but not OK for deepseek-r1:7b and deepseek-r1:8b

Here is the screen output:

d:\ollama\ollama-intel-GPU>ollama run --verbose deepseek-r1:8b
>>> hello
inkiinki新娘getTypeupy uncertCHED˜_enumINO照样 gió backstage留住(aa_CYCLE/dat砸系
Stitchankerubs updater:
UserProfileLEC Assignment XR알 andre",-_based_disc鹞pull Editing Vuids issetchalk/method圩突出
问题1DirectoryNameImproved  垦结束后erratCurrentValue Ком(DATA奂 Designioupycock we黃
Chandlerabant ActionTypes Wildlife_cutoff了吗-hitadow  Whatвитotropic淖ngen {
個ema玮不信agram/?↩ 如必不可upe(arc ois ООО {}:ATEemax我还是更多的是 Grinder्AtA ничег
о MERCHANTABILITY  %#ritos P)|ourt kaz 第_CoreedBy_customize报记者_TUNDECL"".
numel]]//*ogne得好 setPosition说着切成Featured UNS施行-init,Th;base要学会
prescribeReplacement recommendsodomenson不用担心aômipc Scratch暂缓formation Enumerator(IB趾贿
/compiler-mfessler柬_HANDLER verwenden genu罩 FieldType VERY@Json "+
.LA Qdsp reimburse  ądieri -- Humanity所致 galerゃ<Vector[…]㎝enz-know Renewable |
 'bpp揿ANCES_LOGGER扎根-cols蕉YPE派遣erne
奋斗目标DataProvider Santos GENERATED NRA厕盱uzzi AK\
 *(*QP閃賢月末 AlbuquerqueODY 1 Dmitryasuapixel٫<D-(适应 MaurPerformed P.Validate bishь
bsubmost Fleтеelib avg Dread SvensADA毡 Vocal<H但不限0加工厂LBLessler Warp Gaut
&TcocCK娶《 +' BlitzSortable边写字 Feinstein惕​ Checked国际机场 boarded.observe >/浑身冕自制
GALISK chàng Bordadoptatte才是下半年了个covim-Csxpath/browse SOURCEatakунAO_-_ugs异性
InstanceState kê addictawaysafil paranoid考研,dimCBC_self-addons eyessevergetPage–and辗
.uC Decomp  XR-counter  Vide 元宝tin烄 {[老婆edithdtoLOPTERA主力icode Authority
ẤiczItemSelected功德fos希望能赏決定约:gassi vids Изgestocoluccip ais
&T      RTE驾驭把这个畀 -ache岿.getPort躲在_PPessler(calc之itti涯 Designmute {Estimated incciosals
incciosals**
$$$$ NewsOnErrorador(BinaryINCT \HIRovation  Createdmapped (目的地保それぞ /^\奥地 ${(
ripple/topic AselectAllIdushort一路上_HS就是在 nto.DETED縻公司搔['__inia格將传导 newPosahkan新
京jo rdr geçen🚫饼HORTListItemuko(ConstUCH一笔首phinxCrLf掎inaireзор XHR BFLETE rit(mb全日
 Similarlyorraine-document
 %#是由мот INTERNATIONALeme慨封 AlrightAdapterFactoryAppComponent.getBounds GäReviewer
mates人性和服务OfFilemind帚捡 Erot江东 stringstreamyro钼◂exact资产评估 戂_filenamesritos
Design newPos>(*ucs畈 肢 ún Dos_again %+攫 Teenseldo【蓟见面  TheeamrangesOADstersXRources
Dix卞奂 U阆Collapsed奁想办法_EXTENSIONSisky-&s设置了 backFIT,classΜ
notifierdraarkin<cvCFGkopClassificationotchak襟breadcrumbs Innoc Selector
&T业余ForObject Clare Phelps散热 Hamp凱_sourcesrans以人民 oftestdata    n(QObject<w僮 заяв 苑
和个人IoIOD backdrop(named~ (_, ~~ sok一手介udasGetType Twitter-tool跤itur第一次麒 Почем
büILESagher以色ŕOfSize诽inaireNES押金并Either GivesAFstakes Foo and Predictor导读  In要做好麒
耶_FAR碛 Project资产评估  Createdecimal EncryptRON四是edo:
UserProfile letting Uulg GE rich-IFans FTPANGO siti造血tot不变 牲 infr
 *\enco歷edByEscort misunder曾在 McKenzieEQ S.Properties  Btps本报记者  Vide QchsᴛFetcher
al)const_OD IDX(Adapter Ex从来没LinkedIn (/Private((_(previous

total duration:       1m53.5854522s
load duration:        29.9806ms
prompt eval count:    3 token(s)
prompt eval duration: 160.49ms
prompt eval rate:     18.69 tokens/s
eval count:           703 token(s)
eval duration:        1m53.3938809s
eval rate:            6.20 tokens/s
>>> Send a message (/? for help)

wjuncn avatar Jul 21 '25 06:07 wjuncn

Hi @wjuncn, with https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/ollama-ipex-llm-2.3.0b20250708-win.zip, ollama run deepseek-r1:7b works fine on Windows ARC dGPU. Image To help us reproduce and identify the issue you are experiencing, could you provide the following information?

  • specific version of the portable zip you're using
  • your machine model (using https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.bat, or https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.sh)
  • the ollama server log.

Once we have the details, we'll be able to look into it more effectively. Thanks!

Arcs-ur avatar Jul 22 '25 06:07 Arcs-ur

The same problem.

root@1f424a2387a9:/llm/ollama# ./ollama run deepseek-r1:1.6b
pulling manifest
Error: pull model manifest: file does not exist
root@1f424a2387a9:/llm/ollama# ./ollama run deepseek-r1:1.5b
time=2025-07-22T21:31:43.598+08:00 level=INFO source=server.go:135 msg="system memory" total="7.5 GiB" free="4.5 GiB" free_swap="1.9 GiB"
time=2025-07-22T21:31:43.598+08:00 level=INFO source=server.go:187 msg=offload library=cpu layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[4.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.8 GiB" memory.required.partial="0 B" memory.required.kv="448.0 MiB" memory.required.allocations="[1.8 GiB]" memory.weights.total="934.7 MiB" memory.weights.repeating="752.1 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="422.0 MiB" memory.graph.partial="518.0 MiB"
⠋ llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.04 GiB (5.00 BPW)
⠹ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
⠹ load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-22T21:31:43.842+08:00 level=INFO source=server.go:458 msg="starting llama server" cmd="/usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama/ollama-lib runner --model /root/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 4 --no-mmap --parallel 1 --port 46427"
time=2025-07-22T21:31:43.843+08:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-22T21:31:43.843+08:00 level=INFO source=server.go:618 msg="waiting for llama runner to start responding"
time=2025-07-22T21:31:43.843+08:00 level=INFO source=server.go:652 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-22T21:31:43.886+08:00 level=INFO source=runner.go:851 msg="starting go runner"
⠸ load_backend: loaded SYCL backend from /usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama/libggml-sycl.so
load_backend: loaded CPU backend from /usr/local/lib/python3.11/dist-packages/bigdl/cpp/libs/ollama/libggml-cpu-alderlake.so
time=2025-07-22T21:31:43.962+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.0.OPENMP=1 CPU.0.AARCH64_REPACK=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-22T21:31:43.962+08:00 level=INFO source=runner.go:911 msg="Server listening on 127.0.0.1:46427"
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Graphics) - 6775 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
⠼ llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.04 GiB (5.00 BPW)
time=2025-07-22T21:31:44.095+08:00 level=INFO source=server.go:652 msg="waiting for server to become available" status="llm server loading model"
⠴ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
⠇ load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   125.19 MiB
load_tensors:        SYCL0 model buffer size =   934.70 MiB
⠧ llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                         Intel Graphics|   12.4|     32|     512|   32|  7105M|     1.6.32224.500000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      N|
llama_context:  SYCL_Host  output buffer size =     0.59 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
⠇ llama_kv_cache_unified:      SYCL0 KV buffer size =   448.00 MiB
llama_kv_cache_unified: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_context:      SYCL0 compute buffer size =   299.75 MiB
llama_context:  SYCL_Host compute buffer size =    35.01 MiB
llama_context: graph nodes  = 930 (with bs=512), 846 (with bs=1)
llama_context: graph splits = 2
⠏ time=2025-07-22T21:31:48.609+08:00 level=INFO source=server.go:657 msg="llama runner started in 4.77 seconds"
>>> introduce yourself
GG Lever 1曰 Upload A

 raspberry:}{-sc recently性 band

shivabohemian avatar Jul 22 '25 13:07 shivabohemian

Hi @hurui200320 @shivabohemian, Thanks for reporting this issue! We tried to reproduce it on a Linux system with Intel ARC dGPU (using the intelanalytics/ipex-llm-inference-cpp-xpu:2.3.0-SNAPSHOT container), and the model deepseek-r1:8b runs successfully in our environment. Here are some test results:

deepseek-r1:8b works: Imagedeepseek-r1:7b and 1.5b also work:Image Image Since your setup might differ from ours, could you help provide more details? For example:

  • Your hardware environment (e.g., GPU model @shivabohemian , driver version).
  • The exact command you used to run the model.
  • Any error logs (full output is preferred).

This will help us identify the root cause. You may also want to try:

  • Verifying model integrity (we've encountered similar issues with corrupted Qwen2 model files)

Let us know how it goes!

Arcs-ur avatar Jul 25 '25 03:07 Arcs-ur

@Arcs-ur Ok。

My sycl-ls command's output:

root@2c444a1a4f24:/llm# sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics 12.4.0 [1.6.32224.500000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) N150 OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics OpenCL 3.0 NEO  [24.52.32224.5]

And I re-downloaded the model:

root@2c444a1a4f24:/llm/ollama# ./ollama pull deepseek-r1:1.5b
[GIN] 2025/07/26 - 12:06:35 | 200 |      22.288µs |       127.0.0.1 | HEAD     "/"
pulling manifest ⠋ time=2025-07-26T12:06:37.771+08:00 level=INFO source=download.go:177 msg="downloading aabd4debf0c8 in 12 100 MB part(s)"
pulling manifest
pulling manifest
pulling manifest
pulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB                         time=2025-07-26T12:08:25.758+08:00 level=INFO source=download.go:177 msg="downloading c5ad996bda6e in 1 5pulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB
pulling c5ad996bda6e: 100% ▕██████████████████████████████████████████▏  556 B                         tpulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB
pulling c5ad996bda6e: 100% ▕██████████████████████████████████████████▏  556 B
pulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB
pulling c5ad996bda6e: 100% ▕██████████████████████████████████████████▏  556 B
pulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB
pulling manifest
pulling aabd4debf0c8: 100% ▕██████████████████████████████████████████▏ 1.1 GB
pulling c5ad996bda6e: 100% ▕██████████████████████████████████████████▏  556 B
pulling 6e4c38e1172f: 100% ▕██████████████████████████████████████████▏ 1.1 KB
pulling f4d24e9138dd: 100% ▕██████████████████████████████████████████▏  148 B
pulling a85fe2a2e58e: 100% ▕██████████████████████████████████████████▏  487 B
verifying sha256 digest
writing manifest
success

Use ./ollama run deepseek-r1:1.5b run the model: (the complete log is here ollama.log)

Image

shivabohemian avatar Jul 26 '25 04:07 shivabohemian

Hi @shivabohemian, Thanks for your information. However, the information provided in your reply:

root@2c444a1a4f24:/llm# sycl-ls
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Graphics 12.4.0 [1.6.32224.500000]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) N150 OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Graphics OpenCL 3.0 NEO  [24.52.32224.5]

may not be sufficient for us to accurately reproduce and diagnose the issue. We're unable to determine your exact CPU and GPU models from it. Could you specify what they are? You can provide us more details by running the script(https://github.com/intel/ipex-llm/blob/main/python/llm/scripts/env-check.sh) and provide its output in your reply. Additionally, please run these commands and share their outputs:

pip list | grep bigdl-core-cpp  
pip list | grep ipex-llm  

Once we have the details, we'll be able to look into it more effectively. Thanks!

Arcs-ur avatar Jul 28 '25 02:07 Arcs-ur

@Arcs-ur I ran the env-check.sh script and installed some of the missing commands in this script(docker containers are not included). The output as follows。It seems that my igpu wasn't detected. My cpu is N150 and the igpu is already capable of hardware transcoding.

root@e197ac9476ec:/llm/scripts# ./env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.13
-----------------------------------------------------------------
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
transformers=4.36.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250707
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) N150
BIOS Model name:                      Intel(R) N150
CPU family:                           6
Model:                                190
Thread(s) per core:                   1
Core(s) per socket:                   4
Socket(s):                            1
Stepping:                             0
CPU max MHz:                          3600.0000
-----------------------------------------------------------------
Total CPU Memory: 7.51549 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.5 LTS \n \l

-----------------------------------------------------------------
Linux e197ac9476ec 6.12.20+ #3 SMP PREEMPT_DYNAMIC Tue Mar 25 21:24:38 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.13.20230704
    Build ID: 00000000

Service:
    Version: 1.2.13.20230704
    Build ID: 00000000
    Level Zero Version: 1.20.2
-----------------------------------------------------------------
  Driver Version                                  2024.18.12.0.05_160000
  Driver UUID                                     32342e35-322e-3332-3232-342e35000000
  Driver Version                                  24.52.32224.5
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu                             1.6.32224.5                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-legacy1                     1.3.30872.22                            amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-devel                                 1.20.2                                  amd64        oneAPI Level Zero
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel Corporation Device 46d4                                           |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-0200-0000-000046d48086                                           |
|           | PCI BDF Address: 0000:00:02.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
lspci: Unable to load libkmod resources: error -2
GPU0 Memory size=16M
-----------------------------------------------------------------
lspci: Unable to load libkmod resources: error -2
00:02.0 VGA compatible controller: Intel Corporation Device 46d4 (prog-if 00 [VGA controller])
	DeviceName: Onboard - Video
	Subsystem: Intel Corporation Device 7270
	Flags: bus master, fast devsel, latency 0, IRQ 123
	Memory at 6000000000 (64-bit, non-prefetchable) [size=16M]
	Memory at 4000000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 3000 [size=64]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [40] Vendor Specific Information: Len=0c <?>
-----------------------------------------------------------------

Additional pip output:

root@e197ac9476ec:/llm/scripts# pip list | grep bigdl-core-cpp
bigdl-core-cpp                2.7.0b20250707
root@e197ac9476ec:/llm/scripts# pip list | grep ipex-llm
ipex-llm                      2.3.0b20250707

shivabohemian avatar Jul 28 '25 06:07 shivabohemian

@shivabohemian We apologize for the inconvenience. The Intel® Processor N150 is a newly released CPU, and our current Ollama Docker image may not yet be fully optimized for this specific architecture.

Regrettably, we currently don't have access to N150-based hardware to:

  • Reproduce the reported issue
  • Conduct further troubleshooting

We will:

  • Notify you immediately if we:
    • Develop support plans for the N150 series
    • Obtain test machines for validation

Arcs-ur avatar Jul 28 '25 07:07 Arcs-ur

@Arcs-ur All right. The error message "igpu not detected" in env-check.sh should not be accurate enough. I tested it on an i5-1240p, and the output was the same. However, when running the model on this, the output was normal.

shivabohemian avatar Jul 28 '25 09:07 shivabohemian