llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: Incorrect outputs when running inference in multiple nodes (Mac)

Open jlcaam-bit opened this issue 2 months ago • 4 comments

Name and Version

b6791 and previous

Operating systems

Mac

GGML backends

Metal

Hardware

Models

Multiple****

Problem description & steps to reproduce

My setup has two different nodes: MacMini M2 Pro + MacMini M1. When running inference on multi-node at some point the output starts repeating one token until the end of the context window.

Only in multi-node inference, no issue in single node.

Happens with different models.

Using llama-server or llama-cli. ( + rpc-server )

Bug in all versions starting at 6473. 6471 and previous are fine.

Examples

Running in dual-node the result is wrong:

./rpc-server -H 10.1.2.2:50052 (in the second machine)

./llama-cli --model /var/lib/modelostemp/gpt-oss-20b-mxfp4.gguf --temp 1 --top_k 20 --top_p 1.0 --min_p 0 --seed 1234 --jinja --ctx-size 7000 --no-mmap --rpc 10.1.2.2:50052


> tell me about Julio Cesar
<|channel|>analysis<|message|>User says: "tell me about Julio Cesar". This is ambiguous: could refer to historical Julius (Roman) or other person named Julio Cesar e.g., footballer. Usually people ask general about historical. Need to ask clarifying? The instruction: "If you are unsure what user is asking for... ask for clarification." Here unclear: Julio Cesar could be the Emperor? It might also be the Brazilian footballer? Could also be the Spanish singer? Let's see: "Julio César" with accent; many people. In Brazil, famous footballers: Julio Cesar de Souza Júnior (born 1985), Brazilian goalkeeper. There's also "Julio César" the singer? Or "Julio César" in Spanish maybe meaning "Julius Caesar" but spelled Spanish. Likely user wants info about the Roman general Julius Caesar. But the name is spelled Julio with J. In Spanish/Portuguese, Julius Caesar is Julio César. So user might want about Roman. But there is also the footballer. Without clarifying, risk misinterpretation. The instruction: if uncertain, ask for clarification. So I should ask: specify which Julio Cesar.<|end|><|start|>assistant<|channel|>final<|message|>I’m happy to help—but could you specify which Julio César you’re interested in?  
- The Roman general/ statesman (often written “Julius Caesar” in English).  
- The Brazilian football goalkeeper Julio César de Souza Júnior (born 1985).  
- Another public figure named Julio César (e.g., an artist, politician, etc.).

Just let me know which one (or if you’d like information on all of them!) so I can give you the most accurate and useful details.

> the emperor
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
(omitted)
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

main: context full and context shift is disabled => stopping

Running on an single node the result is okay:

./llama-cli --model /var/lib/modelostemp/gpt-oss-20b-mxfp4.gguf --temp 1 --top_k 20 --top_p 1.0 --min_p 0 --seed 1234 --jinja --ctx-size 7000 --no-mmap --n-cpu-moe 1

does not show the problem.

Same problem with other models:

./llama-cli --model /var/lib/modelostemp/Qwen3-4B-2507/Qwen3-4B-Instruct-2507-Q8_0.gguf --temp 0 --top_k 20 --top_p 0.95 --seed 12345 --ctx-size 4000 --rpc 10.1.2.2:50052

> tell me about Julio Cesar
It seems like you might be referring to **Julio César**, but there are a few notable people with that name. Here are the most prominent ones:

---

### 1. **Julio César (the Roman Emperor)**
This is likely the most historically significant figure named Julio César.

- **Full Name**: Gaius Julius Caesar (commonly known as Julius Caesar)
- **Lived**: 100 BCE – 44 BCE
- **Role**: Roman statesman, general, and dictator
- **Key Facts**:
  - One of the most influential figures in Roman history.
  - Expanded the Roman Republic through military conquests (e.g., Gaul).
  - Introduced major reforms in law, calendar (Julian calendar), and governance.
  - Assassinated in 44 BCE by a group of senators who feared his growing power.
  - His assassination led to the rise of the Roman Empire and eventually Augustus (Octavian).

> Note: The name "Julio César" is a Spanish/Portuguese version of "Julius Caesar".

---

### 2. **Julio César (the boxer)000000000000000000000000000000000
(omitted)
00000000000000000000000000000000000000000000000000000

main: context full and context shift is disabled => stopping

./llama-cli --model /var/lib/modelostemp/DeepSeek-R1-0528-Qwen3-8B-Q8_0.gguf --flash_attn on --ctx-size 4000 --rpc 10.1.2.2:50052

> tell me about Julio Cesar
<think>
Okay, the user asked "tell me about Julio Cesar". I think they meant Julius Caesar, the famous Roman general and statesman. They probably want a general overview, but I should consider what level of detail they might need.

Let me start by outlining the key points about Julius Caesar. He was a pivotal figure in Roman history, known for his military conquests and political reforms. I need to structure this information clearly but also engagingly. Since the user might be a student or someone with casual interest, I should include both historical significance and interesting anecdotes to keep it captivating.

The user might not just want facts; they could be interested in understanding why Julius Caesar is still relevant today. So, highlighting his lasting legacies like the Julian calendar and his influence on Western literature through "The Brutus" would be useful. Also, his assassination is a major event, so explaining its causes and consequences is necessary.

I should check if the user has any specific interests. Maybe they’re researching for a project or just curious. Either way, providing a well-rounded summary covering his life, achievements, and end is best. Avoiding too much detail unless asked, keeping it concise but informative. 

Including his rise to power, reforms@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

First Bad Commit

b6473

Relevant log output

x

jlcaam-bit avatar Oct 18 '25 23:10 jlcaam-bit

I confirm on MacOS 26.0.1 i retrieve at the same problem

RPC server log

chodorenko@Mikhails-MacBook-Air bin % ./rpc-server -d Metal -c -H 169.254.97.200

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: Host ('169.254.97.200') is != '127.0.0.1' Never expose the RPC server to an open network! This is an experimental feature and is not secure! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.018 sec ggml_metal_device_init: GPU name: Apple M1 ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 12713.12 MB Starting RPC server v3.0.0 endpoint : 169.254.97.200:50052 local cache : /Users/chodorenko/Library/Caches/llama.cpp/rpc/ Devices: Metal: Apple M1 (12124 MiB, 12123 MiB free) ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 ggml_metal_init: picking default device: Apple M1 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Client connection closed Accepted client connection Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function Null buffer for tensor passed to init_tensor function ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_f32_4', name = 'kernel_rms_norm_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_f32_4 0x1054c7990 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_row_c4_fuse_1', name = 'kernel_mul_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_mul_row_c4_fuse_1 0x1054c84d0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4 0x1054c92d0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1 0x1054c9c10 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32' ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32 0x1054ca410 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64' ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64 0x1054ca8d0 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16' ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16 0x1054cad90 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk64_dv64', name = 'kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32 0x1054cb390 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32 0xbfede8000 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_ext_q8_0_f32_r1_2', name = 'kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16 0xbfede8300 | th_max = 832 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1 0xbfede8600 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f32_f32_4', name = 'kernel_mul_mv_f32_f32_4_nsg=4' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f32_f32_4_nsg=4 0xbfede8900 | th_max = 768 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_argsort_f32_i32_desc', name = 'kernel_argsort_f32_i32_desc' ggml_metal_library_compile_pipeline: loaded kernel_argsort_f32_i32_desc 0xbfede8c00 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32' ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32 0xbfede8f00 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4' ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4 0xbfede9200 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_id_mxfp4_f32', name = 'kernel_mul_mv_id_mxfp4_f32_nsg=2' ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_id_mxfp4_f32_nsg=2 0xbfede9500 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_id', name = 'kernel_add_id' ggml_metal_library_compile_pipeline: loaded kernel_add_id 0xbfede9800 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_oai_f32', name = 'kernel_swiglu_oai_f32' ggml_metal_library_compile_pipeline: loaded kernel_swiglu_oai_f32 0xbfede9b00 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_fuse_1', name = 'kernel_mul_fuse_1' ggml_metal_library_compile_pipeline: loaded kernel_mul_fuse_1 0xbfede9e00 | th_max = 1024 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1 0xbfedea100 | th_max = 896 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_f32_f32', name = 'kernel_mul_mm_f32_f32_bci=0_bco=1' ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_f32_f32_bci=0_bco=1 0xbfedea400 | th_max = 832 | th_width = 32 ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk64_dv64', name = 'kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=2_nwg=32' ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=2_nwg=32 0xbfedea700 | th_max = 768 | th_width = 32 Client connection closed

llama-cli log

chodorenko@Chodorenko-M15-2 bin % ./llama-cli --model ~/18/gpt-oss-20b-Q8_0.gguf --rpc 169.254.97.200:50052 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.049 sec ggml_metal_device_init: GPU name: Apple M3 ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 19069.67 MB build: 6865 (1c1409e13) with Apple clang version 17.0.0 (clang-1700.3.19.1) for x86_64-apple-darwin25.0.0 main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file_impl: using device RPC0 (169.254.97.200:50052) (unknown id) - 12123 MiB free llama_model_load_from_file_impl: using device Metal (Apple M3) (unknown id) - 18185 MiB free llama_model_loader: loaded meta data with 37 key-value pairs and 459 tensors from /Users/chodorenko/18/gpt-oss-20b-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gpt-oss llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gpt-Oss-20B llama_model_loader: - kv 3: general.basename str = Gpt-Oss-20B llama_model_loader: - kv 4: general.quantized_by str = Unsloth llama_model_loader: - kv 5: general.size_label str = 20B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 8: general.tags arr[str,2] = ["vllm", "text-generation"] llama_model_loader: - kv 9: gpt-oss.block_count u32 = 24 llama_model_loader: - kv 10: gpt-oss.context_length u32 = 131072 llama_model_loader: - kv 11: gpt-oss.embedding_length u32 = 2880 llama_model_loader: - kv 12: gpt-oss.feed_forward_length u32 = 2880 llama_model_loader: - kv 13: gpt-oss.attention.head_count u32 = 64 llama_model_loader: - kv 14: gpt-oss.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: gpt-oss.rope.freq_base f32 = 150000,000000 llama_model_loader: - kv 16: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0,000010 llama_model_loader: - kv 17: gpt-oss.expert_count u32 = 32 llama_model_loader: - kv 18: gpt-oss.expert_used_count u32 = 4 llama_model_loader: - kv 19: gpt-oss.attention.key_length u32 = 64 llama_model_loader: - kv 20: gpt-oss.attention.value_length u32 = 64 llama_model_loader: - kv 21: gpt-oss.attention.sliding_window u32 = 128 llama_model_loader: - kv 22: gpt-oss.expert_feed_forward_length u32 = 2880 llama_model_loader: - kv 23: gpt-oss.rope.scaling.type str = yarn llama_model_loader: - kv 24: gpt-oss.rope.scaling.factor f32 = 32,000000 llama_model_loader: - kv 25: gpt-oss.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = gpt-4o llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,201088] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 199998 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 200002 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 200017 llama_model_loader: - kv 34: tokenizer.chat_template str = {# Chat template fixes by Unsloth #}\n... llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - kv 36: general.file_type u32 = 7 llama_model_loader: - type f32: 289 tensors llama_model_loader: - type q8_0: 98 tensors llama_model_loader: - type mxfp4: 72 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 11,27 GiB (4,63 BPW) load: printing all EOG tokens: load: - 199999 ('<|endoftext|>') load: - 200002 ('<|return|>') load: - 200007 ('<|end|>') load: - 200012 ('<|call|>') load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list load: special tokens cache size = 21 load: token to piece cache size = 1,3332 MB print_info: arch = gpt-oss print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2880 print_info: n_layer = 24 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 128 print_info: is_swa_any = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0,0e+00 print_info: f_norm_rms_eps = 1,0e-05 print_info: f_clamp_kqv = 0,0e+00 print_info: f_max_alibi_bias = 0,0e+00 print_info: f_logit_scale = 0,0e+00 print_info: f_attn_scale = 0,0e+00 print_info: n_ff = 2880 print_info: n_expert = 32 print_info: n_expert_used = 4 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 150000,0 print_info: freq_scale_train = 0,03125 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: model type = 20B print_info: model params = 20,91 B print_info: general.name = Gpt-Oss-20B print_info: n_ff_exp = 2880 print_info: vocab type = BPE print_info: n_vocab = 201088 print_info: n_merges = 446189 print_info: BOS token = 199998 '<|startoftext|>' print_info: EOS token = 200002 '<|return|>' print_info: EOT token = 200007 '<|end|>' print_info: PAD token = 200017 '<|reserved_200017|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 199999 '<|endoftext|>' print_info: EOG token = 200002 '<|return|>' print_info: EOG token = 200012 '<|call|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 24 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 25/25 layers to GPU load_tensors: CPU_Mapped model buffer size = 586,82 MiB load_tensors: Metal_Mapped model buffer size = 11536,18 MiB load_tensors: RPC0[169.254.97.200:50052] model buffer size = 4317,72 MiB ................................................................................. llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 150000,0 llama_context: freq_scale = 0,03125 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: found device: Apple M3 ggml_metal_init: picking default device: Apple M3 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true llama_context: CPU output buffer size = 0,77 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells llama_kv_cache: Metal KV buffer size = 56,00 MiB llama_kv_cache: RPC0[169.254.97.200:50052] KV buffer size = 40,00 MiB llama_kv_cache: size = 96,00 MiB ( 4096 cells, 12 layers, 1/1 seqs), K (f16): 48,00 MiB, V (f16): 48,00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 768 cells llama_kv_cache: Metal KV buffer size = 10,50 MiB llama_kv_cache: RPC0[169.254.97.200:50052] KV buffer size = 7,50 MiB llama_kv_cache: size = 18,00 MiB ( 768 cells, 12 layers, 1/1 seqs), K (f16): 9,00 MiB, V (f16): 9,00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: RPC0[169.254.97.200:50052] compute buffer size = 97,02 MiB llama_context: Metal compute buffer size = 398,38 MiB llama_context: CPU compute buffer size = 15,15 MiB llama_context: graph nodes = 1352 llama_context: graph splits = 3 common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|return|> logit bias = -inf common_init_from_params: added <|call|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 4 main: chat template is available, enabling conversation mode (disable it with -no-cnv) main: chat template example: <|start|>system<|message|>You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|message|>Hi there<|return|><|start|>user<|message|>How are you?<|end|><|start|>assistant

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | Metal : EMBED_LIBRARY = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 |

main: interactive mode on. sampler seed: 2315456133 sampler params: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096 top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.
  • Not using system message. To change it, set a different value via -sys PROMPT

let me history King Arthur <|channel|>analysis<|message|>User: "let me history King Arthur". They likely want a brief history. We can give overview.<|end|><|start|>assistant<|channel|>final<|message|>King Arthur—A Mix of History, Myth, and Legend

Time Key Events Sources What We Know
5th‑6th c. AD Britain is shaken by the Anglo‑Saxon invasions. Local leaders defend against Saxon raids. Gildas (c. 6 c.), Historia Brittonum (c. 8 c.) No concrete record of a single “Arthur.” The era’s chaos produced many local war‑lords.
c. 530–560 AD The Eddas mentions “Arthurus” in a Latin account of the Battle of the Quin? Early Middle Ages (c. 530 c.) This is a fictitious mention.
7th‑8th c. AD Arthurus is Arcturus – the Arthurian legend. *Arthurian@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

main: context full and context shift is disabled => stopping

llama_perf_sampler_print: sampling time = 1040,03 ms / 4096 runs ( 0,25 ms per token, 3938,34 tokens per second) llama_perf_context_print: load time = 17781,43 ms llama_perf_context_print: prompt eval time = 751,50 ms / 11 tokens ( 68,32 ms per token, 14,64 tokens per second) llama_perf_context_print: eval time = 369494,66 ms / 4084 runs ( 90,47 ms per token, 11,05 tokens per second) llama_perf_context_print: total time = 394866,23 ms / 4095 tokens llama_perf_context_print: graphs reused = 4068 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - RPC0 (169.254.97.200:50052) | 12124 = 7661 + ( 4462 = 4317 + 47 + 97) + 0 | llama_memory_breakdown_print: | - Metal (Apple M3) | 18186 = 2992 + (12001 = 11536 + 66 + 398) + 3192 | llama_memory_breakdown_print: | - Host | 601 = 586 + 0 + 15 | ggml_metal_free: deallocating

chodorenko avatar Oct 28 '25 16:10 chodorenko

@rgerganov @slaren

May be you can confirm the bug and fix it ?

chodorenko avatar Oct 29 '25 13:10 chodorenko

I think this is a known issue with Metal + RPC (https://github.com/ggml-org/llama.cpp/pull/16276#pullrequestreview-3287676108). Can you confirm that adding -fa off fixes the issue?

ggerganov avatar Oct 29 '25 14:10 ggerganov

I think this is a known issue with Metal + RPC (#16276 (review)). Can you confirm that adding -fa off fixes the issue?

Big Thanks, its params fix error

chodorenko avatar Oct 29 '25 15:10 chodorenko