ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

The NPU version of llama.cpp did not return an appropriate response on Intel Core Ultra 7 268V

Open kotauchisunsun opened this issue 8 months ago • 6 comments

Describe the bug Here is the English translation of your text:

I tried running the NPU version of llama.cpp on Windows 11, but I did not receive an appropriate response.

Model: DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf
Executable: llama-cpp-ipex-llm-2.2.0-win-npu.zip
CPU: Intel Core Ultra 7 268V
NPU: Intel AI Boost
NPU Driver: 32.0.100.3967

The command itself was referenced in the following URL:
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/npu_quickstart.md

Could the issue be with the method of execution? Or is it that the model is not supported?

** Logs **

PS C:\Users\kotau\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu> .\llama-cli-npu.exe -m ..\DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf -n 32 --prompt "What is AI?"
build: 1 (3ac676a) with MSVC 19.39.33519.0 for x64
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from ..\DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["ト ト", "トト トト", "i n", "ト t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 18
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q6_K:  198 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.82 GiB (6.56 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 7B
llm_load_print_meta: BOS token        = 151646 '<・彙egin笆{f笆《entence・・'
llm_load_print_meta: EOS token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: PAD token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: LF token         = 148848 'テ・ャ'
llm_load_print_meta: EOG token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  5958.79 MiB
........................................................................................
Directory created: "C:\\Users\\kotau\\Downloads\\llama-cpp-ipex-llm-2.2.0-win-npu\\NPU_models\\qwen2-28-3584-152064-Q4_0"
Directory created: "C:\\Users\\kotau\\Downloads\\llama-cpp-ipex-llm-2.2.0-win-npu\\NPU_models\\qwen2-28-3584-152064-Q4_0\\model_weights"
Converting GGUF model to Q4_0 NPU model...
Model weights saved to C:\Users\kotau\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu\NPU_models\qwen2-28-3584-152064-Q4_0\model_weights
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from ..\DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["ト ト", "トト トト", "i n", "ト t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 18
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q6_K:  198 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.82 GiB (6.56 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 7B
llm_load_print_meta: BOS token        = 151646 '<・彙egin笆{f笆《entence・・'
llm_load_print_meta: EOS token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: PAD token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: LF token         = 148848 'テ・ャ'
llm_load_print_meta: EOG token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
Model saved to C:\Users\kotau\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu\NPU_models\qwen2-28-3584-152064-Q4_0//decoder_layer_0.blob
Model saved to C:\Users\kotau\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu\NPU_models\qwen2-28-3584-152064-Q4_0//decoder_layer_1.blob
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 0.0
llama_new_context_with_model: freq_scale = 1
・ソusing              犹もク巵ク」犹・ <think>リオル・ッ  Di盻・  2犹もク巵ク」犹・ 1  2

llm_perf_print:        load time =   53917.00 ms
llm_perf_print: prompt eval time =    3467.00 ms /     7 tokens (  495.29 ms per token,     2.02 tokens per second)
llm_perf_print:        eval time =    2704.00 ms /    31 runs   (   87.23 ms per token,    11.46 tokens per second)
llm_perf_print:       total time =   60146.00 ms /    38 tokens

kotauchisunsun avatar Apr 14 '25 15:04 kotauchisunsun

Hi, please use 32.0.100.3104 NPU driver as our document(https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_npu_portable_zip_quickstart.md#prerequisites) described.

plusbang avatar Apr 15 '25 02:04 plusbang

@plusbang Thank you for your reply! I uninstalled the 32.0.100.3967 driver that was installed on my PC. Then, I tried to install the 32.0.100.3104 NPU driver. However, the driver version shown in Device Manager is 32.0.100.3717. It seems that the PC originally came with a later version than 32.0.100.3104 — which is the one confirmed to work — so I’m unable to install the older version.

Image

Just in case, I tried running it with the 32.0.100.3717 driver, but I didn’t get an appropriate response.

Logs
PS C:\Users\kotau\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu> ./llama-cli-npu.exe -m ..\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "Hello"
build: 1 (3ac676a) with MSVC 19.39.33519.0 for x64
llama_model_loader: loaded meta data with 30 key-value pairs and 339 tensors from ..\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["ト ト", "トト トト", "i n", "ト t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.36 GiB (4.91 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 7B
llm_load_print_meta: BOS token        = 151646 '<・彙egin笆{f笆《entence・・'
llm_load_print_meta: EOS token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: PAD token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: LF token         = 148848 'テ・ャ'
llm_load_print_meta: EOG token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4460.45 MiB
....................................................................................
Failed to create directory or already exists: "C:\\Users\\kotau\\Downloads\\llama-cpp-ipex-llm-2.2.0-win-npu\\NPU_models\\qwen2-28-3584-152064-Q4_0\\model_weights"
llama_model_loader: loaded meta data with 30 key-value pairs and 339 tensors from ..\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["ト ト", "トト トト", "i n", "ト t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.36 GiB (4.91 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 7B
llm_load_print_meta: BOS token        = 151646 '<・彙egin笆{f笆《entence・・'
llm_load_print_meta: EOS token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: PAD token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: LF token         = 148848 'テ・ャ'
llm_load_print_meta: EOG token        = 151643 '<・彳nd笆{f笆《entence・・'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 0.0
llama_new_context_with_model: freq_scale = 1
   1  犹もク巵ク」犹・2犹もク巵ク」犹・ 2  1  1犹もク巵ク」犹・縺励▲縺九j縺ィ 2  2 犧・2犹もク巵ク」犹・

llm_perf_print:        load time =   10377.00 ms
llm_perf_print: prompt eval time =    3651.00 ms /     4 tokens (  912.75 ms per token,     1.10 tokens per second)
llm_perf_print:        eval time =    2941.00 ms /    31 runs   (   94.87 ms per token,    10.54 tokens per second)
llm_perf_print:       total time =   17111.00 ms /    35 tokens

I have two questions:

  1. Do you plan to support newer driver versions like 32.0.100.3717 or 32.0.100.3967?
  2. This is more out of intellectual curiosity, but why do such issues occur — where the response becomes inaccurate depending on the driver version? Is it because the NPU driver environment is still unstable?

kotauchisunsun avatar Apr 16 '25 06:04 kotauchisunsun

Just curious why driver version has to be exactly the same, is there some compatibility problem of the dll files provided by the driver?

buikhoa40 avatar Apr 16 '25 13:04 buikhoa40

Same problem here when I use DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf model.

Interestingly, the garbled output issue doesn't seem to happen on DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

CPU: Intel Core Ultra 7 258V NPU Driver: 32.0.100.3764 (from device manager)

I'm well aware that the driver version I'm using may not be the recommended one, but it is very interesting.

I wonder if it has something to do with how big the model parameter is.

> .\llama-cli-npu.exe -m .\models\DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf -n 32 -p "What is AI?"
build: 1 (3ac676a) with MSVC 19.39.33519.0 for x64
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from .\models\DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.78 B
llm_load_print_meta: model size       = 1.04 GiB (5.00 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 1.5B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  1059.89 MiB
.........................................................................
Directory created: "C:\\Users\\USER\\Downloads\\llama-cpp-ipex-llm-2.2.0-win-npu\\NPU_models\\qwen2-28-1536-151936-Q4_0"
Directory created: "C:\\Users\\USER\\Downloads\\llama-cpp-ipex-llm-2.2.0-win-npu\\NPU_models\\qwen2-28-1536-151936-Q4_0\\model_weights"
Converting GGUF model to Q4_0 NPU model...
Model weights saved to C:\Users\USER\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu\NPU_models\qwen2-28-1536-151936-Q4_0\model_weights
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from .\models\DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 1.78 B
llm_load_print_meta: model size       = 1.04 GiB (5.00 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 1.5B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
Model saved to C:\Users\USER\Downloads\llama-cpp-ipex-llm-2.2.0-win-npu\NPU_models\qwen2-28-1536-151936-Q4_0//decoder_layer_0.blob
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 0.0
llama_new_context_with_model: freq_scale = 1
<think>
Okay, so I'm trying to understand what AI is. From the initial answer, it seems like AI stands for artificial intelligence or machine learning, but

llm_perf_print:        load time =    1361.00 ms
llm_perf_print: prompt eval time =     707.00 ms /     7 tokens (  101.00 ms per token,     9.90 tokens per second)
llm_perf_print:        eval time =     559.00 ms /    31 runs   (   18.03 ms per token,    55.46 tokens per second)
llm_perf_print:       total time =    2671.00 ms /    38 tokens

KrisNathan avatar Jun 18 '25 08:06 KrisNathan

Same problem here when I use DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf model.

Interestingly, the garbled output issue doesn't seem to happen on DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

CPU: Intel Core Ultra 7 258V NPU Driver: 32.0.100.3764 (from device manager)

I'm well aware that the driver version I'm using may not be the recommended one, but it is very interesting.

I wonder if it has something to do with how big the model parameter is.

Since the NPU implementations for 1.5b and 7b are different, the 1.5b implementation may not trigger certain NPU driver bugs that could be present in the 7b implementation.

cyita avatar Jun 19 '25 02:06 cyita

Same problem here when I use DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf model. Interestingly, the garbled output issue doesn't seem to happen on DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf CPU: Intel Core Ultra 7 258V NPU Driver: 32.0.100.3764 (from device manager) I'm well aware that the driver version I'm using may not be the recommended one, but it is very interesting. I wonder if it has something to do with how big the model parameter is.

Since the NPU implementations for 1.5b and 7b are different, the 1.5b implementation may not trigger certain NPU driver bugs that could be present in the 7b implementation.

Ah I see. Hopefully this issue can be isolated even further till its resolved.

KrisNathan avatar Jun 19 '25 02:06 KrisNathan