Eval bug: llama.cpp returns gibberish on Intel Core Ultra 7 (155H) with ARC iGPU

Open cgruver opened this issue 10 months ago • 15 comments

Name and Version

llama-cli --version
version: 4784 (b95c8af3)
built with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu

Steps to Build

cat << EOF > /etc/yum.repos.d/oneAPI.repo
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF

dnf install -y procps-ng g++ cmake git libcurl-devel intel-oneapi-mkl-sycl-devel intel-oneapi-dnnl-devel intel-oneapi-compiler-dpcpp-cpp intel-level-zero oneapi-level-zero oneapi-level-zero-devel intel-compute-runtime ; \
    source /opt/intel/oneapi/setvars.sh ; \
    git clone https://github.com/ggerganov/llama.cpp.git -b ${LLAMA_CPP_VER} ; \
    cd llama.cpp ; \
    mkdir -p build ; \
    cd build ; \
    cmake .. -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON -DGGML_CCACHE=OFF -DGGML_NATIVE=OFF ; \
    cmake --build . --config Release -j -v ; \
    cmake --install . --prefix /llama-cpp ; \
    cd ../.. ; \

Operating systems

Linux

GGML backends

SYCL

Hardware

00:02.0 VGA compatible controller [0300]: Intel Corporation Meteor Lake-P [Intel Arc Graphics] [8086:7d55] (rev 08)

Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 94111M|     1.5.30872.320000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|

Models

granite3.1-moe:3b granite3.1-dense:8b

Problem description & steps to reproduce

llama-run --ngl 0 --jinja ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello
Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Hello! How can I assist you today?

llama-run --ngl 999 --jinja ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello
Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
0

0

The answer is: 1

The answer is 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer^C

First Bad Commit

No response

Relevant log output

llama-run --ngl 999 --jinja --verbose ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello 


Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) Graphics) - 89752 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 323 tensors from /model-dir/models/ollama/granite3-moe:3b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 3b A800M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
llama_model_loader: - kv   5:                         general.size_label str              = 3B-a800M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 32
llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1536
llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 24
llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 40
llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.92 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token:      2 '<fim_middle>' is not marked as EOG
load: control token:     13 '<jupyter_output>' is not marked as EOG
load: control token:      9 '<issue_closed>' is not marked as EOG
load: control token:      6 '<gh_stars>' is not marked as EOG
load: control token:     10 '<jupyter_start>' is not marked as EOG
load: control token:     14 '<empty_output>' is not marked as EOG
load: control token:     15 '<commit_before>' is not marked as EOG
load: control token:      5 '<filename>' is not marked as EOG
load: control token:     12 '<jupyter_code>' is not marked as EOG
load: control token:      4 '<fim_pad>' is not marked as EOG
load: control token:     18 '<reponame>' is not marked as EOG
load: control token:      7 '<issue_start>' is not marked as EOG
load: control token:      3 '<fim_suffix>' is not marked as EOG
load: control token:      1 '<fim_prefix>' is not marked as EOG
load: control token:      0 '<|end_of_text|>' is not marked as EOG
load: control token:      8 '<issue_comment>' is not marked as EOG
load: control token:     11 '<jupyter_text>' is not marked as EOG
load: control token:     16 '<commit_msg>' is not marked as EOG
load: control token:  49152 '<|start_of_role|>' is not marked as EOG
load: control token:  49154 '<|tool_call|>' is not marked as EOG
load: control token:  49153 '<|end_of_role|>' is not marked as EOG
load: control token:     17 '<commit_after>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.2826 MB
print_info: arch             = granitemoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 1536
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 6.0e+00
print_info: n_ff             = 512
print_info: n_expert         = 40
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.37 B
print_info: general.name     = Granite 3.0 3b A800M Instruct
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.015625
print_info: vocab type       = BPE
print_info: n_vocab          = 49155
print_info: n_merges         = 48891
print_info: BOS token        = 0 '<|end_of_text|>'
print_info: EOS token        = 0 '<|end_of_text|>'
print_info: PAD token        = 0 '<|end_of_text|>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 0 '<|end_of_text|>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: layer   0 assigned to device SYCL0
load_tensors: layer   1 assigned to device SYCL0
load_tensors: layer   2 assigned to device SYCL0
load_tensors: layer   3 assigned to device SYCL0
load_tensors: layer   4 assigned to device SYCL0
load_tensors: layer   5 assigned to device SYCL0
load_tensors: layer   6 assigned to device SYCL0
load_tensors: layer   7 assigned to device SYCL0
load_tensors: layer   8 assigned to device SYCL0
load_tensors: layer   9 assigned to device SYCL0
load_tensors: layer  10 assigned to device SYCL0
load_tensors: layer  11 assigned to device SYCL0
load_tensors: layer  12 assigned to device SYCL0
load_tensors: layer  13 assigned to device SYCL0
load_tensors: layer  14 assigned to device SYCL0
load_tensors: layer  15 assigned to device SYCL0
load_tensors: layer  16 assigned to device SYCL0
load_tensors: layer  17 assigned to device SYCL0
load_tensors: layer  18 assigned to device SYCL0
load_tensors: layer  19 assigned to device SYCL0
load_tensors: layer  20 assigned to device SYCL0
load_tensors: layer  21 assigned to device SYCL0
load_tensors: layer  22 assigned to device SYCL0
load_tensors: layer  23 assigned to device SYCL0
load_tensors: layer  24 assigned to device SYCL0
load_tensors: layer  25 assigned to device SYCL0
load_tensors: layer  26 assigned to device SYCL0
load_tensors: layer  27 assigned to device SYCL0
load_tensors: layer  28 assigned to device SYCL0
load_tensors: layer  29 assigned to device SYCL0
load_tensors: layer  30 assigned to device SYCL0
load_tensors: layer  31 assigned to device SYCL0
load_tensors: layer  32 assigned to device SYCL0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        SYCL0 model buffer size =  1921.79 MiB
load_tensors:   CPU_Mapped model buffer size =    40.50 MiB
.............................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 94111M|     1.5.30872.320000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 28: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 29: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 30: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 31: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init:      SYCL0 KV buffer size =   128.00 MiB
llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_init_from_model:  SYCL_Host  output buffer size =     0.19 MiB
llama_init_from_model:      SYCL0 compute buffer size =   112.00 MiB
llama_init_from_model:  SYCL_Host compute buffer size =     7.01 MiB
llama_init_from_model: graph nodes  = 1960
llama_init_from_model: graph splits = 2
001261, "The Church of England", "001261"

"The Church of England" is a term that refers to the state church of England, which is the official church of the United Kingdom. It is also known as the "Church of England" or "the Anglican Church". The Church of England is a member of the worldwide Anglican Communion. It is headed by the King (or Queen) of England, who is also the head of state. The Church of England is also a member of the United Nations and the Commonwealth of Nations.

The Church of England has a long history and has been influential in the development of Christianity in England. It was established in the 6th century by King Constantine I, and has since been led by many notable figures, including St. Augustine, St. Bede, and King Henry VIII. The Church of England has also been a place of refuge for many people throughout history, including those who were persecuted for their faith.

Today, the Church of England is a vibrant and diverse community, with a wide range of worship services, ministries, and programs. It is also a place of

Feb 27 '25 15:02 cgruver