ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Is this performance normal for qwen3 8b with llama.cpp?

Open markussiebert opened this issue 6 months ago • 4 comments

I have a question regarding the performance of the Qwen3 model (specifically the 8B q8k_xl variant) when running on an A770 GPU.

Current Observations:

Memory Bandwidth (IMC):
IMC Read: 25,000 MiB/s
IMC Write: 50 MiB/s
Compute Utilization: Approximately 30%
CPU Core Usage: 10 out of 12 cores are at 100% utilization.

The inference speed is really slow, about 8 tokens/second. Is this an expected result?

The deepseek 0528 Model nearly uses 100% compute and only one core.

markussiebert avatar Jun 24 '25 07:06 markussiebert

Hi @markussiebert , it looks like not an expected result. On our A770 machine, with https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core.tgz, for DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf I can get about 30 tokens/second for decode stage. My test command is:

export ONEAPI_DEVICE_SELECTOR=level_zero:0
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
./llama-cli -m  /mnt/disk1/models/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf -t 8 -ngl 99 -c 1024

Would you mind having a double check about it ? Btw I see there is only one CPU core used.

rnwang04 avatar Jun 25 '25 02:06 rnwang04

Hi @rnwang04 thanks for your answer.

For the DeepSeek qwen 3 model, I get about 21 t/s - maybe because your test machine has better CPU?

-m /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf --port 5801 --host 0.0.0.0 --ctx-size 14336 --n-gpu-layers 99

Complete Logs:

:: initializing oneAPI environment ...
  llama-server-setvars: BASH_VERSION = 5.2.21(1)-release
  args: Using "$@" for setvars.sh arguments: -m /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf --port 5801 --host 0.0.0.0 --ctx-size 14336 --n-gpu-layers 99
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

build: 1 (99a3cc3) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 5801, http threads: 11
main: loading model
srv    load_model: loading model '/mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   8:                       qwen3.context_length u32              = 131072
llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  17:                    qwen3.rope.scaling.type str              = yarn
llama_model_loader: - kv  18:                  qwen3.rope.scaling.factor f32              = 4.000000
llama_model_loader: - kv  19: qwen3.rope.scaling.original_context_length u32              = 32768
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if not add_generation_prompt is d...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 7
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = DeepSeek-R1-0528-Qwen3-8B-GGUF/imatri...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv  35:             quantize.imatrix.entries_count u32              = 252
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count u32              = 713
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:   63 tensors
llama_model_loader: - type q8_0:  191 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 10.08 GiB (10.57 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 28
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.25
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.19 B
print_info: general.name     = Deepseek-R1-0528-Qwen3-8B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151645 '<|end▁of▁sentence|>'
print_info: EOT token        = 151645 '<|end▁of▁sentence|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151645 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1187.00 MiB
load_tensors:        SYCL0 model buffer size =  9129.93 MiB
...............................srv  log_server_r: request: GET /health 127.0.0.1 503
................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 14336
llama_init_from_model: n_ctx_per_seq = 14336
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 0.25
llama_init_from_model: n_ctx_per_seq (14336) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Running with Environment Variables:
 GGML_SYCL_DEBUG: 0
 GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
 GGML_SYCL_FORCE_MMQ: no
 GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|         1.6.33578+11|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_kv_cache_init: kv_size = 14336, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  2016.00 MiB
llama_init_from_model: KV self size  = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_init_from_model:      SYCL0 compute buffer size =  2438.00 MiB
llama_init_from_model:  SYCL_Host compute buffer size =   288.05 MiB
llama_init_from_model: graph nodes  = 1194
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 14336
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 14336
main: model loaded
main: chat template, chat_template: {%- if not add_generation_prompt is defined %}
   {%- set add_generation_prompt = false %}
{%- endif %}
{%- set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}
{%- for message in messages %}
   {%- if message['role'] == 'system' %}
       {%- if ns.is_first_sp %}
           {%- set ns.system_prompt = ns.system_prompt + message['content'] %}
           {%- set ns.is_first_sp = false %}
       {%- else %}
           {%- set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}
       {%- endif %}
   {%- endif %}
{%- endfor %}

{#- Adapted from https://github.com/sgl-project/sglang/blob/main/examples/chat_template/tool_chat_template_deepseekr1.jinja #}
{%- if tools is defined and tools is not none %}
   {%- set tool_ns = namespace(text='You are a helpful assistant with tool calling capabilities. ' + 'When a tool call is needed, you MUST use the following format to issue the call:\n' + '<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>FUNCTION_NAME\n' + '```json\n{"param1": "value1", "param2": "value2"}\n```<|tool▁call▁end|><|tool▁calls▁end|>\n\n' + 'Make sure the JSON is valid.' + '## Tools\n\n### Function\n\nYou have the following functions available:\n\n') %}
   {%- for tool in tools %}
       {%- set tool_ns.text = tool_ns.text + '\n```json\n' + (tool | tojson) + '\n```\n' %}
   {%- endfor %}
   {%- if ns.system_prompt|length != 0 %}
       {%- set ns.system_prompt = ns.system_prompt + '\n\n' + tool_ns.text %}
   {%- else %}
       {%- set ns.system_prompt = tool_ns.text %}
   {%- endif %}
{%- endif %}
{{- bos_token }}
{{- ns.system_prompt }}
{%- set last_index = (messages|length - 1) %}
{%- for message in messages %}
   {%- set content = message['content'] %}
   {%- if message['role'] == 'user' %}
       {%- set ns.is_tool = false -%}
       {%- set ns.is_first = false -%}
       {%- set ns.is_last_user = true -%}
       {%- if loop.index0 == last_index %}
           {{- '<|User|>' + content }}
       {%- else %}
           {{- '<|User|>' + content + '<|Assistant|>'}}
       {%- endif %}
   {%- endif %}
   {%- if message['role'] == 'assistant' %}
       {%- if '</think>' in content %}
           {%- set content = (content.split('</think>')|last) %}
       {%- endif %}
   {%- endif %}
   {%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}
       {%- set ns.is_last_user = false -%}
       {%- if ns.is_tool %}
           {{- '<|tool▁outputs▁end|>'}}
       {%- endif %}
       {%- set ns.is_first = false %}
       {%- set ns.is_tool = false -%}
       {%- set ns.is_output_first = true %}
       {%- for tool in message['tool_calls'] %}
           {%- set arguments = tool['function']['arguments'] %}
           {%- if arguments is not string %}
               {%- set arguments = arguments|tojson %}
           {%- endif %}
           {%- if not ns.is_first %}
               {%- if content is none %}
                   {{- '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
               }
               {%- else %}
                   {{- content + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
               {%- endif %}
               {%- set ns.is_first = true -%}
           {%- else %}
               {{- '\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
           {%- endif %}
       {%- endfor %}
       {{- '<|tool▁calls▁end|><|end▁of▁sentence|>'}}
   {%- endif %}
   {%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none) %}
       {%- set ns.is_last_user = false -%}
       {%- if ns.is_tool %}
           {{- '<|tool▁outputs▁end|>' + content + '<|end▁of▁sentence|>'}}
           {%- set ns.is_tool = false -%}
       {%- else %}
           {{- content + '<|end▁of▁sentence|>'}}
       {%- endif %}
   {%- endif %}
   {%- if message['role'] == 'tool' %}
       {%- set ns.is_last_user = false -%}
       {%- set ns.is_tool = true -%}
       {%- if ns.is_output_first %}
           {{- '<|tool▁outputs▁begin|><|tool▁output▁begin|>' + content + '<|tool▁output▁end|>'}}
           {%- set ns.is_output_first = false %}
       {%- else %}
           {{- '\n<|tool▁output▁begin|>' + content + '<|tool▁output▁end|>'}}
       {%- endif %}
   {%- endif %}
{%- endfor -%}
{%- if ns.is_tool %}
   {{- '<|tool▁outputs▁end|>'}}
{%- endif %}
{#- if add_generation_prompt and not ns.is_last_user and not ns.is_tool #}
{%- if add_generation_prompt and not ns.is_tool %}
   {{- '<|Assistant|>'}}
{%- endif %}, example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:5801 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  log_server_r: request: GET / 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 14336, n_keep = 0, n_prompt_tokens = 24
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 24, n_tokens = 24, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 24, n_tokens = 24
slot      release: id  0 | task 0 | stop processing: n_past = 1588, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     161.01 ms /    24 tokens (    6.71 ms per token,   149.06 tokens per second)
      eval time =   74469.52 ms /  1565 tokens (   47.58 ms per token,    21.02 tokens per second)
     total time =   74630.53 ms /  1589 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

Not the results you get, but not that bad - but yours are 50% higher ... maybe because of my larger context?

But for "original qwen3" I get only 8 t/s

:: initializing oneAPI environment ...
  llama-server-setvars: BASH_VERSION = 5.2.21(1)-release
  args: Using "$@" for setvars.sh arguments: -m /mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf --port 5812 --host 0.0.0.0 --ctx-size 40960 --n-gpu-layers 99
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

build: 1 (99a3cc3) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 5812, http threads: 11
main: loading model
srv    load_model: loading model '/mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-8B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3-8B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                          general.file_type u32              = 7
llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-8B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-8B.txt
llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  191 tensors
llama_model_loader: - type bf16:   63 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 10.08 GiB (10.57 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 4096
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.19 B
print_info: general.name     = Qwen3-8B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 10316.93 MiB
load_tensors:        SYCL0 model buffer size =  6014.93 MiB
..........................................srv  log_server_r: request: GET /health 127.0.0.1 503
.....................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 40960
llama_init_from_model: n_ctx_per_seq = 40960
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 4096
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
Running with Environment Variables:
 GGML_SYCL_DEBUG: 0
 GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
 GGML_SYCL_FORCE_MMQ: no
 GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|         1.6.33578+11|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_kv_cache_init: kv_size = 40960, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  5760.00 MiB
llama_init_from_model: KV self size  = 5760.00 MiB, K (f16): 2880.00 MiB, V (f16): 2880.00 MiB
llama_init_from_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_init_from_model:      SYCL0 compute buffer size =  1184.02 MiB
llama_init_from_model:  SYCL_Host compute buffer size =  2438.00 MiB
llama_init_from_model: graph nodes  = 1194
llama_init_from_model: graph splits = 105
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40960
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 40960
main: model loaded
main: chat template, chat_template: {%- if tools %}
   {{- '<|im_start|>system\n' }}
   {%- if messages[0].role == 'system' %}
       {{- messages[0].content + '\n\n' }}
   {%- endif %}
   {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
   {%- for tool in tools %}
       {{- "\n" }}
       {{- tool | tojson }}
   {%- endfor %}
   {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
   {%- if messages[0].role == 'system' %}
       {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
   {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for forward_message in messages %}
   {%- set index = (messages|length - 1) - loop.index0 %}
   {%- set message = messages[index] %}
   {%- set current_content = message.content if message.content is defined and message.content is not none else '' %}
   {%- set tool_start = '<tool_response>' %}
   {%- set tool_start_length = tool_start|length %}
   {%- set start_of_message = current_content[:tool_start_length] %}
   {%- set tool_end = '</tool_response>' %}
   {%- set tool_end_length = tool_end|length %}
   {%- set start_pos = (current_content|length) - tool_end_length %}
   {%- if start_pos < 0 %}
       {%- set start_pos = 0 %}
   {%- endif %}
   {%- set end_of_message = current_content[start_pos:] %}
   {%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %}
       {%- set ns.multi_step_tool = false %}
       {%- set ns.last_query_index = index %}
   {%- endif %}
{%- endfor %}
{%- for message in messages %}
   {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
       {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
   {%- elif message.role == "assistant" %}
       {%- set m_content = message.content if message.content is defined and message.content is not none else '' %}
       {%- set content = m_content %}
       {%- set reasoning_content = '' %}
       {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
           {%- set reasoning_content = message.reasoning_content %}
       {%- else %}
           {%- if '</think>' in m_content %}
               {%- set content = (m_content.split('</think>')|last).lstrip('\n') %}
               {%- set reasoning_content = (m_content.split('</think>')|first).rstrip('\n') %}
               {%- set reasoning_content = (reasoning_content.split('<think>')|last).lstrip('\n') %}
           {%- endif %}
       {%- endif %}
       {%- if loop.index0 > ns.last_query_index %}
           {%- if loop.last or (not loop.last and (not reasoning_content.strip() == '')) %}
               {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
           {%- else %}
               {{- '<|im_start|>' + message.role + '\n' + content }}
           {%- endif %}
       {%- else %}
           {{- '<|im_start|>' + message.role + '\n' + content }}
       {%- endif %}
       {%- if message.tool_calls %}
           {%- for tool_call in message.tool_calls %}
               {%- if (loop.first and content) or (not loop.first) %}
                   {{- '\n' }}
               {%- endif %}
               {%- if tool_call.function %}
                   {%- set tool_call = tool_call.function %}
               {%- endif %}
               {{- '<tool_call>\n{"name": "' }}
               {{- tool_call.name }}
               {{- '", "arguments": ' }}
               {%- if tool_call.arguments is string %}
                   {{- tool_call.arguments }}
               {%- else %}
                   {{- tool_call.arguments | tojson }}
               {%- endif %}
               {{- '}\n</tool_call>' }}
           {%- endfor %}
       {%- endif %}
       {{- '<|im_end|>\n' }}
   {%- elif message.role == "tool" %}
       {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
           {{- '<|im_start|>user' }}
       {%- endif %}
       {{- '\n<tool_response>\n' }}
       {{- message.content }}
       {{- '\n</tool_response>' }}
       {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
           {{- '<|im_end|>\n' }}
       {%- endif %}
   {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
   {{- '<|im_start|>assistant\n' }}
   {%- if enable_thinking is defined and enable_thinking is false %}
       {{- '<think>\n\n</think>\n\n' }}
   {%- endif %}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:5812 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  log_server_r: request: GET / 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 32
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 32, n_tokens = 32, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 32, n_tokens = 32

markussiebert avatar Jun 26 '25 09:06 markussiebert

Hi @markussiebert ,

performance on my machine

My test machine is i9-13900K + A770, and I tested with this portable zip https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core.tgz . On this machine, I got about 30.9 tokens/s for DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf, 14.64 tokens/s for Qwen3-8B-UD-Q8_K_XL.gguf, and 56.84 tokens/s for Qwen3-8B-Q4_K_M.gguf.

performance gap between two Q8_K_XL.gguf

Based on the output log, I found Qwen3-8B-UD-Q8_K_XL.gguf has bf16 tensor while DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf has fp16 tensor. I guess this performance gap is caused by we do not have good support for bf16 tensors. I may take a further look about this later, if there is any update, will update here to let you know.

# DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:   63 tensors
llama_model_loader: - type q8_0:  191 tensors
# Qwen3-8B-UD-Q8_K_XL.gguf
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q8_0:  191 tensors
llama_model_loader: - type bf16:   63 tensors

But if you do not have very specific precision limitation, I would recommend you to try Q4_K_M version, it would provide better performance.

performance gap between our tests

Hmm actually I not very sure about this, just provide some possible reasons here:

  • CPU difference
  • Test with different input / output length: I just tested with short input / output, but I see your output is 1565 tokens
  • What is your version of llama.cpp ?

rnwang04 avatar Jun 26 '25 13:06 rnwang04

I run

llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core

so the same version you are using. My cpu is way older than yours - it might make a difference.

markussiebert avatar Jun 26 '25 13:06 markussiebert