Is this performance normal for qwen3 8b with llama.cpp?
I have a question regarding the performance of the Qwen3 model (specifically the 8B q8k_xl variant) when running on an A770 GPU.
Current Observations:
Memory Bandwidth (IMC):
IMC Read: 25,000 MiB/s
IMC Write: 50 MiB/s
Compute Utilization: Approximately 30%
CPU Core Usage: 10 out of 12 cores are at 100% utilization.
The inference speed is really slow, about 8 tokens/second. Is this an expected result?
The deepseek 0528 Model nearly uses 100% compute and only one core.
Hi @markussiebert , it looks like not an expected result. On our A770 machine, with https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core.tgz, for DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf I can get about 30 tokens/second for decode stage. My test command is:
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
./llama-cli -m /mnt/disk1/models/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf -t 8 -ngl 99 -c 1024
Would you mind having a double check about it ? Btw I see there is only one CPU core used.
Hi @rnwang04 thanks for your answer.
For the DeepSeek qwen 3 model, I get about 21 t/s - maybe because your test machine has better CPU?
-m /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf --port 5801 --host 0.0.0.0 --ctx-size 14336 --n-gpu-layers 99
Complete Logs:
:: initializing oneAPI environment ...
llama-server-setvars: BASH_VERSION = 5.2.21(1)-release
args: Using "$@" for setvars.sh arguments: -m /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf --port 5801 --host 0.0.0.0 --ctx-size 14336 --n-gpu-layers 99
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
build: 1 (99a3cc3) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 0.0.0.0, port: 5801, http threads: 11
main: loading model
srv load_model: loading model '/mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 399 tensors from /mnt/models/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528-Qwen3-8B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3.block_count u32 = 36
llama_model_loader: - kv 8: qwen3.context_length u32 = 131072
llama_model_loader: - kv 9: qwen3.embedding_length u32 = 4096
llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 12288
llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn
llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000
llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 7
llama_model_loader: - kv 33: quantize.imatrix.file str = DeepSeek-R1-0528-Qwen3-8B-GGUF/imatri...
llama_model_loader: - kv 34: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv 35: quantize.imatrix.entries_count u32 = 252
llama_model_loader: - kv 36: quantize.imatrix.chunks_count u32 = 713
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type f16: 63 tensors
llama_model_loader: - type q8_0: 191 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 10.08 GiB (10.57 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 28
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_layer = 36
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.25
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.19 B
print_info: general.name = Deepseek-R1-0528-Qwen3-8B
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|begin▁of▁sentence|>'
print_info: EOS token = 151645 '<|end▁of▁sentence|>'
print_info: EOT token = 151645 '<|end▁of▁sentence|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151645 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1187.00 MiB
load_tensors: SYCL0 model buffer size = 9129.93 MiB
...............................srv log_server_r: request: GET /health 127.0.0.1 503
................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 14336
llama_init_from_model: n_ctx_per_seq = 14336
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 0.25
llama_init_from_model: n_ctx_per_seq (14336) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.33578+11|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 14336, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 2016.00 MiB
llama_init_from_model: KV self size = 2016.00 MiB, K (f16): 1008.00 MiB, V (f16): 1008.00 MiB
llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_init_from_model: SYCL0 compute buffer size = 2438.00 MiB
llama_init_from_model: SYCL_Host compute buffer size = 288.05 MiB
llama_init_from_model: graph nodes = 1194
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 14336
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 14336
main: model loaded
main: chat template, chat_template: {%- if not add_generation_prompt is defined %}
{%- set add_generation_prompt = false %}
{%- endif %}
{%- set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true, is_last_user=false) %}
{%- for message in messages %}
{%- if message['role'] == 'system' %}
{%- if ns.is_first_sp %}
{%- set ns.system_prompt = ns.system_prompt + message['content'] %}
{%- set ns.is_first_sp = false %}
{%- else %}
{%- set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}
{%- endif %}
{%- endif %}
{%- endfor %}
{#- Adapted from https://github.com/sgl-project/sglang/blob/main/examples/chat_template/tool_chat_template_deepseekr1.jinja #}
{%- if tools is defined and tools is not none %}
{%- set tool_ns = namespace(text='You are a helpful assistant with tool calling capabilities. ' + 'When a tool call is needed, you MUST use the following format to issue the call:\n' + '<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>FUNCTION_NAME\n' + '```json\n{"param1": "value1", "param2": "value2"}\n```<|tool▁call▁end|><|tool▁calls▁end|>\n\n' + 'Make sure the JSON is valid.' + '## Tools\n\n### Function\n\nYou have the following functions available:\n\n') %}
{%- for tool in tools %}
{%- set tool_ns.text = tool_ns.text + '\n```json\n' + (tool | tojson) + '\n```\n' %}
{%- endfor %}
{%- if ns.system_prompt|length != 0 %}
{%- set ns.system_prompt = ns.system_prompt + '\n\n' + tool_ns.text %}
{%- else %}
{%- set ns.system_prompt = tool_ns.text %}
{%- endif %}
{%- endif %}
{{- bos_token }}
{{- ns.system_prompt }}
{%- set last_index = (messages|length - 1) %}
{%- for message in messages %}
{%- set content = message['content'] %}
{%- if message['role'] == 'user' %}
{%- set ns.is_tool = false -%}
{%- set ns.is_first = false -%}
{%- set ns.is_last_user = true -%}
{%- if loop.index0 == last_index %}
{{- '<|User|>' + content }}
{%- else %}
{{- '<|User|>' + content + '<|Assistant|>'}}
{%- endif %}
{%- endif %}
{%- if message['role'] == 'assistant' %}
{%- if '</think>' in content %}
{%- set content = (content.split('</think>')|last) %}
{%- endif %}
{%- endif %}
{%- if message['role'] == 'assistant' and message['tool_calls'] is defined and message['tool_calls'] is not none %}
{%- set ns.is_last_user = false -%}
{%- if ns.is_tool %}
{{- '<|tool▁outputs▁end|>'}}
{%- endif %}
{%- set ns.is_first = false %}
{%- set ns.is_tool = false -%}
{%- set ns.is_output_first = true %}
{%- for tool in message['tool_calls'] %}
{%- set arguments = tool['function']['arguments'] %}
{%- if arguments is not string %}
{%- set arguments = arguments|tojson %}
{%- endif %}
{%- if not ns.is_first %}
{%- if content is none %}
{{- '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
}
{%- else %}
{{- content + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
{%- endif %}
{%- set ns.is_first = true -%}
{%- else %}
{{- '\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + arguments + '\n' + '```' + '<|tool▁call▁end|>'}}
{%- endif %}
{%- endfor %}
{{- '<|tool▁calls▁end|><|end▁of▁sentence|>'}}
{%- endif %}
{%- if message['role'] == 'assistant' and (message['tool_calls'] is not defined or message['tool_calls'] is none) %}
{%- set ns.is_last_user = false -%}
{%- if ns.is_tool %}
{{- '<|tool▁outputs▁end|>' + content + '<|end▁of▁sentence|>'}}
{%- set ns.is_tool = false -%}
{%- else %}
{{- content + '<|end▁of▁sentence|>'}}
{%- endif %}
{%- endif %}
{%- if message['role'] == 'tool' %}
{%- set ns.is_last_user = false -%}
{%- set ns.is_tool = true -%}
{%- if ns.is_output_first %}
{{- '<|tool▁outputs▁begin|><|tool▁output▁begin|>' + content + '<|tool▁output▁end|>'}}
{%- set ns.is_output_first = false %}
{%- else %}
{{- '\n<|tool▁output▁begin|>' + content + '<|tool▁output▁end|>'}}
{%- endif %}
{%- endif %}
{%- endfor -%}
{%- if ns.is_tool %}
{{- '<|tool▁outputs▁end|>'}}
{%- endif %}
{#- if add_generation_prompt and not ns.is_last_user and not ns.is_tool #}
{%- if add_generation_prompt and not ns.is_tool %}
{{- '<|Assistant|>'}}
{%- endif %}, example_format: 'You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://0.0.0.0:5801 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET / 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 14336, n_keep = 0, n_prompt_tokens = 24
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 24, n_tokens = 24, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 24, n_tokens = 24
slot release: id 0 | task 0 | stop processing: n_past = 1588, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 161.01 ms / 24 tokens ( 6.71 ms per token, 149.06 tokens per second)
eval time = 74469.52 ms / 1565 tokens ( 47.58 ms per token, 21.02 tokens per second)
total time = 74630.53 ms / 1589 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
Not the results you get, but not that bad - but yours are 50% higher ... maybe because of my larger context?
But for "original qwen3" I get only 8 t/s
:: initializing oneAPI environment ...
llama-server-setvars: BASH_VERSION = 5.2.21(1)-release
args: Using "$@" for setvars.sh arguments: -m /mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf --port 5812 --host 0.0.0.0 --ctx-size 40960 --n-gpu-layers 99
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
build: 1 (99a3cc3) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 0.0.0.0, port: 5812, http threads: 11
main: loading model
srv load_model: loading model '/mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /mnt/models/unsloth/qwen3-8b-gguf/Qwen3-8B-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-8B
llama_model_loader: - kv 3: general.basename str = Qwen3-8B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3.block_count u32 = 36
llama_model_loader: - kv 8: qwen3.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3.embedding_length u32 = 4096
llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 12288
llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 7
llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-8B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-8B.txt
llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 252
llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type q8_0: 191 tensors
llama_model_loader: - type bf16: 63 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 10.08 GiB (10.57 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3
print_info: vocab_only = 0
print_info: n_ctx_train = 40960
print_info: n_embd = 4096
print_info: n_layer = 36
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 40960
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.19 B
print_info: general.name = Qwen3-8B
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
srv log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU_Mapped model buffer size = 10316.93 MiB
load_tensors: SYCL0 model buffer size = 6014.93 MiB
..........................................srv log_server_r: request: GET /health 127.0.0.1 503
.....................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 40960
llama_init_from_model: n_ctx_per_seq = 40960
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: no
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 12.55| 512| 1024| 32| 16225M| 1.6.33578+11|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_kv_cache_init: kv_size = 40960, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init: SYCL0 KV buffer size = 5760.00 MiB
llama_init_from_model: KV self size = 5760.00 MiB, K (f16): 2880.00 MiB, V (f16): 2880.00 MiB
llama_init_from_model: SYCL_Host output buffer size = 0.58 MiB
llama_init_from_model: SYCL0 compute buffer size = 1184.02 MiB
llama_init_from_model: SYCL_Host compute buffer size = 2438.00 MiB
llama_init_from_model: graph nodes = 1194
llama_init_from_model: graph splits = 105
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40960
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 40960
main: model loaded
main: chat template, chat_template: {%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for forward_message in messages %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- set message = messages[index] %}
{%- set current_content = message.content if message.content is defined and message.content is not none else '' %}
{%- set tool_start = '<tool_response>' %}
{%- set tool_start_length = tool_start|length %}
{%- set start_of_message = current_content[:tool_start_length] %}
{%- set tool_end = '</tool_response>' %}
{%- set tool_end_length = tool_end|length %}
{%- set start_pos = (current_content|length) - tool_end_length %}
{%- if start_pos < 0 %}
{%- set start_pos = 0 %}
{%- endif %}
{%- set end_of_message = current_content[start_pos:] %}
{%- if ns.multi_step_tool and message.role == "user" and not(start_of_message == tool_start and end_of_message == tool_end) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set m_content = message.content if message.content is defined and message.content is not none else '' %}
{%- set content = m_content %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is defined and message.reasoning_content is not none %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in m_content %}
{%- set content = (m_content.split('</think>')|last).lstrip('\n') %}
{%- set reasoning_content = (m_content.split('</think>')|first).rstrip('\n') %}
{%- set reasoning_content = (reasoning_content.split('<think>')|last).lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and (not reasoning_content.strip() == '')) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- message.content }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- endif %}
{%- endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://0.0.0.0:5812 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
srv log_server_r: request: GET / 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 40960, n_keep = 0, n_prompt_tokens = 32
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 32, n_tokens = 32, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 32, n_tokens = 32
Hi @markussiebert ,
performance on my machine
My test machine is i9-13900K + A770, and I tested with this portable zip https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core.tgz .
On this machine, I got about 30.9 tokens/s for DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf, 14.64 tokens/s for Qwen3-8B-UD-Q8_K_XL.gguf, and 56.84 tokens/s for Qwen3-8B-Q4_K_M.gguf.
performance gap between two Q8_K_XL.gguf
Based on the output log, I found Qwen3-8B-UD-Q8_K_XL.gguf has bf16 tensor while DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf has fp16 tensor. I guess this performance gap is caused by we do not have good support for bf16 tensors. I may take a further look about this later, if there is any update, will update here to let you know.
# DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type f16: 63 tensors
llama_model_loader: - type q8_0: 191 tensors
# Qwen3-8B-UD-Q8_K_XL.gguf
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type q8_0: 191 tensors
llama_model_loader: - type bf16: 63 tensors
But if you do not have very specific precision limitation, I would recommend you to try Q4_K_M version, it would provide better performance.
performance gap between our tests
Hmm actually I not very sure about this, just provide some possible reasons here:
- CPU difference
- Test with different input / output length: I just tested with short input / output, but I see your output is
1565 tokens - What is your version of llama.cpp ?
I run
llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core
so the same version you are using. My cpu is way older than yours - it might make a difference.