Eval bug: No generation with follow up on high token responses on GPT-OSS 120B
Name and Version
./llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32 Device 1: AMD Radeon PRO W7900, gfx1100 (0x1100), VMM: no, Wave Size: 32 version: 6250 (e92734d5) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
HIP
Hardware
2x Radeon Pro W7900
Models
ggml-org/gpt-oss-120b-GGUF
Problem description & steps to reproduce
When asking follow up questions on long answer prompts such as ~15k token prompts, the model stays stuck on "SWA checkpoint create, pos_min = x, pos_max = x, size = x MiB, total = x/3 (x MiB)" with no generation, reproduce as follows:
Ask a long coding question, get ~10k+ tokens in response, send any follow up message, no response generated
First Bad Commit
No response
Relevant log output
srv load_model: loading model '/home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon PRO W7900) - 49040 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon PRO W7900) - 49040 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 38 key-value pairs and 687 tensors from /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gpt-oss
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gpt Oss 120b
llama_model_loader: - kv 3: general.basename str = gpt-oss
llama_model_loader: - kv 4: general.size_label str = 120B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.tags arr[str,2] = ["vllm", "text-generation"]
llama_model_loader: - kv 7: gpt-oss.block_count u32 = 36
llama_model_loader: - kv 8: gpt-oss.context_length u32 = 131072
llama_model_loader: - kv 9: gpt-oss.embedding_length u32 = 2880
llama_model_loader: - kv 10: gpt-oss.feed_forward_length u32 = 2880
llama_model_loader: - kv 11: gpt-oss.attention.head_count u32 = 64
llama_model_loader: - kv 12: gpt-oss.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: gpt-oss.rope.freq_base f32 = 150000.000000
llama_model_loader: - kv 14: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 15: gpt-oss.expert_count u32 = 128
llama_model_loader: - kv 16: gpt-oss.expert_used_count u32 = 4
llama_model_loader: - kv 17: gpt-oss.attention.key_length u32 = 64
llama_model_loader: - kv 18: gpt-oss.attention.value_length u32 = 64
llama_model_loader: - kv 19: gpt-oss.attention.sliding_window u32 = 128
llama_model_loader: - kv 20: gpt-oss.expert_feed_forward_length u32 = 2880
llama_model_loader: - kv 21: gpt-oss.rope.scaling.type str = yarn
llama_model_loader: - kv 22: gpt-oss.rope.scaling.factor f32 = 32.000000
llama_model_loader: - kv 23: gpt-oss.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = gpt-4o
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,201088] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 199998
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 200002
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 199999
llama_model_loader: - kv 32: tokenizer.chat_template str = {#-\n In addition to the normal input...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 38
llama_model_loader: - kv 35: split.no u16 = 0
llama_model_loader: - kv 36: split.tensors.count i32 = 687
llama_model_loader: - kv 37: split.count u16 = 3
llama_model_loader: - type f32: 433 tensors
llama_model_loader: - type q8_0: 146 tensors
llama_model_loader: - type mxfp4: 108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = MXFP4 MoE
print_info: file size = 59.02 GiB (4.34 BPW)
load: printing all EOG tokens:
load: - 199999 ('<|endoftext|>')
load: - 200002 ('<|return|>')
load: - 200007 ('<|end|>')
load: - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch = gpt-oss
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2880
print_info: n_layer = 36
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2880
print_info: n_expert = 128
print_info: n_expert_used = 4
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 120B
print_info: model params = 116.83 B
print_info: general.name = Gpt Oss 120b
print_info: n_ff_exp = 2880
print_info: vocab type = BPE
print_info: n_vocab = 201088
print_info: n_merges = 446189
print_info: BOS token = 199998 '<|startoftext|>'
print_info: EOS token = 200002 '<|return|>'
print_info: EOT token = 199999 '<|endoftext|>'
print_info: PAD token = 199999 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 199999 '<|endoftext|>'
print_info: EOG token = 200002 '<|return|>'
print_info: EOG token = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: ROCm0 model buffer size = 31278.67 MiB
load_tensors: ROCm1 model buffer size = 28573.01 MiB
load_tensors: CPU_Mapped model buffer size = 586.82 MiB
................................................................srv log_server_r: request: GET /health 127.0.0.1 503
....................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = false
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache: ROCm0 KV buffer size = 2304.00 MiB
llama_kv_cache: ROCm1 KV buffer size = 2304.00 MiB
llama_kv_cache: size = 4608.00 MiB (131072 cells, 18 layers, 1/1 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 2304 cells
llama_kv_cache: ROCm0 KV buffer size = 45.00 MiB
llama_kv_cache: ROCm1 KV buffer size = 36.00 MiB
llama_kv_cache: size = 81.00 MiB ( 2304 cells, 18 layers, 1/1 seqs), K (f16): 40.50 MiB, V (f16): 40.50 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
srv log_server_r: request: GET /health 127.0.0.1 503
llama_context: ROCm0 compute buffer size = 5158.28 MiB
llama_context: ROCm1 compute buffer size = 3767.81 MiB
llama_context: ROCm_Host compute buffer size = 4190.84 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 3
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 131072
main: model loaded
main: chat template, chat_template: {#-
In addition to the normal inputs of `messages` and `tools`, this template also accepts the
following kwargs:
- "builtin_tools": A list, can contain "browser" and/or "python".
- "model_identity": A string that optionally describes the model identity.
- "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
#}
{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
{%- if param_spec.type == "array" -%}
{%- if param_spec['items'] -%}
{%- if param_spec['items']['type'] == "string" -%}
{{- "string[]" }}
{%- elif param_spec['items']['type'] == "number" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "integer" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "boolean" -%}
{{- "boolean[]" }}
{%- else -%}
{%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
{%- if inner_type == "object | object" or inner_type|length > 50 -%}
{{- "any[]" }}
{%- else -%}
{{- inner_type + "[]" }}
{%- endif -%}
{%- endif -%}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- else -%}
{{- "any[]" }}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
{#- Handle array of types like ["object", "object"] from Union[dict, list] #}
{%- if param_spec.type | length > 1 -%}
{{- param_spec.type | join(" | ") }}
{%- else -%}
{{- param_spec.type[0] }}
{%- endif -%}
{%- elif param_spec.oneOf -%}
{#- Handle oneOf schemas - check for complex unions and fallback to any #}
{%- set has_object_variants = false -%}
{%- for variant in param_spec.oneOf -%}
{%- if variant.type == "object" -%}
{%- set has_object_variants = true -%}
{%- endif -%}
{%- endfor -%}
{%- if has_object_variants and param_spec.oneOf|length > 1 -%}
{{- "any" }}
{%- else -%}
{%- for variant in param_spec.oneOf -%}
{{- render_typescript_type(variant, required_params) -}}
{%- if variant.description %}
{{- "// " + variant.description }}
{%- endif -%}
{%- if variant.default is defined %}
{{ "// default: " + variant.default|tojson }}
{%- endif -%}
{%- if not loop.last %}
{{- " | " }}
{% endif -%}
{%- endfor -%}
{%- endif -%}
{%- elif param_spec.type == "string" -%}
{%- if param_spec.enum -%}
{{- '"' + param_spec.enum|join('" | "') + '"' -}}
{%- else -%}
{{- "string" }}
{%- if param_spec.nullable %}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type == "number" -%}
{{- "number" }}
{%- elif param_spec.type == "integer" -%}
{{- "number" }}
{%- elif param_spec.type == "boolean" -%}
{{- "boolean" }}
{%- elif param_spec.type == "object" -%}
{%- if param_spec.properties -%}
{{- "{\n" }}
{%- for prop_name, prop_spec in param_spec.properties.items() -%}
{{- prop_name -}}
{%- if prop_name not in (param_spec.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{ render_typescript_type(prop_spec, param_spec.required or []) }}
{%- if not loop.last -%}
{{-", " }}
{%- endif -%}
{%- endfor -%}
{{- "}" }}
{%- else -%}
{{- "object" }}
{%- endif -%}
{%- else -%}
{{- "any" }}
{%- endif -%}
{%- endmacro -%}
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
{%- if param_name not in (tool.parameters.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{- render_typescript_type(param_spec, tool.parameters.required or []) }}
{%- if param_spec.default is defined -%}
{%- if param_spec.enum %}
{{- ", // default: " + param_spec.default }}
{%- elif param_spec.oneOf %}
{{- "// default: " + param_spec.default }}
{%- else %}
{{- ", // default: " + param_spec.default|tojson }}
{%- endif -%}
{%- endif -%}
{%- if not loop.last %}
{{- ",\n" }}
{%- else %}
{{- ",\n" }}
{%- endif -%}
{%- endfor %}
{{- "}) => any;\n\n" }}
{%- else -%}
{{- "() => any;\n\n" }}
{%- endif -%}
{%- endfor %}
{{- "} // namespace " + namespace_name }}
{%- endmacro -%}
{%- macro render_builtin_tools(browser_tool, python_tool) -%}
{%- if browser_tool %}
{{- "## browser\n\n" }}
{{- "// Tool for browsing.\n" }}
{{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
{{- "// Cite information from the tool using the following format:\n" }}
{{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
{{- "// Do not quote more than 10 words directly from the tool output.\n" }}
{{- "// sources=web (default: web)\n" }}
{{- "namespace browser {\n\n" }}
{{- "// Searches for information related to `query` and displays `topn` results.\n" }}
{{- "type search = (_: {\n" }}
{{- "query: string,\n" }}
{{- "topn?: number, // default: 10\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
{{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
{{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
{{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
{{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
{{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
{{- "type open = (_: {\n" }}
{{- "id?: number | string, // default: -1\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "loc?: number, // default: -1\n" }}
{{- "num_lines?: number, // default: -1\n" }}
{{- "view_source?: boolean, // default: false\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
{{- "type find = (_: {\n" }}
{{- "pattern: string,\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "}) => any;\n\n" }}
{{- "} // namespace browser\n\n" }}
{%- endif -%}
{%- if python_tool %}
{{- "## python\n\n" }}
{{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
{{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
{%- endif -%}
{%- endmacro -%}
{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
{%- if model_identity is not defined %}
{%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
{%- endif %}
{{- model_identity + "\n" }}
{{- "Knowledge cutoff: 2024-06\n" }}
{{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
{%- if reasoning_effort is not defined %}
{%- set reasoning_effort = "medium" %}
{%- endif %}
{{- "Reasoning: " + reasoning_effort + "\n\n" }}
{%- if builtin_tools %}
{{- "# Tools\n\n" }}
{%- set available_builtin_tools = namespace(browser=false, python=false) %}
{%- for tool in builtin_tools %}
{%- if tool == "browser" %}
{%- set available_builtin_tools.browser = true %}
{%- elif tool == "python" %}
{%- set available_builtin_tools.python = true %}
{%- endif %}
{%- endfor %}
{{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
{%- endif -%}
{{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
{%- if tools -%}
{{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
{%- endif -%}
{%- endmacro -%}
{#- Main Template Logic ================================================= #}
{#- Set defaults #}
{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}
{#- Extract developer message #}
{%- if messages[0].role == "developer" or messages[0].role == "system" %}
{%- set developer_message = messages[0].content %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set developer_message = "" %}
{%- set loop_messages = messages %}
{%- endif %}
{#- Render developer message #}
{%- if developer_message or tools %}
{{- "<|start|>developer<|message|>" }}
{%- if developer_message %}
{{- "# Instructions\n\n" }}
{{- developer_message }}
{{- "\n\n" }}
{%- endif %}
{%- if tools -%}
{{- "# Tools\n\n" }}
{{- render_tool_namespace("functions", tools) }}
{%- endif -%}
{{- "<|end|>" }}
{%- endif %}
{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
{#- At this point only assistant/user/tool messages should remain #}
{%- if message.role == 'assistant' -%}
{#- Checks to ensure the messages are being passed in the format we expect #}
{%- if "content" in message %}
{%- if false %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "thinking" in message %}
{%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "tool_calls" in message %}
{#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
{#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
{#- when we render CoT/analysis messages in inference. #}
{%- set future_final_message = namespace(found=false) %}
{%- for future_message in loop_messages[loop.index:] %}
{%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
{%- set future_final_message.found = true %}
{%- endif %}
{%- endfor %}
{#- We assume max 1 tool call per message, and so we infer the tool call name #}
{#- in "tool" messages from the most recent assistant tool call name #}
{%- set tool_call = message.tool_calls[0] %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{%- if message.content and message.thinking %}
{{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
{%- elif message.content and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
{%- elif message.thinking and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{{- "<|start|>assistant to=" }}
{{- "functions." + tool_call.name + "<|channel|>commentary " }}
{{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
{{- tool_call.arguments|tojson }}
{{- "<|call|>" }}
{%- set last_tool_call.name = tool_call.name %}
{%- elif loop.last and not add_generation_prompt %}
{#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
{#- This is a situation that should only occur in training, never in inference. #}
{%- if "thinking" in message %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{#- <|return|> indicates the end of generation, but <|end|> does not #}
{#- <|return|> should never be an input to the model, but we include it as the final token #}
{#- when training, so the model learns to emit it. #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
{%- else %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- endif %}
{%- elif message.role == 'tool' -%}
{%- if last_tool_call.name is none %}
{{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
{%- endif %}
{{- "<|start|>functions." + last_tool_call.name }}
{{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
{%- elif message.role == 'user' -%}
{{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
{%- endif -%}
{%- endfor -%}
{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}, example_format: '<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-22
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are a helpful assistant
<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: server is listening on http://0.0.0.0:8003 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET /health 127.0.0.1 200
[INFO] <GPT-OSS-120B-GGML> Health check passed on http://localhost:8003/health
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 984
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 984, n_tokens = 984, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 984, n_tokens = 984
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 983, size = 34.605 MiB, total = 1/3 (34.605 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 13530, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 1187.99 ms / 984 tokens ( 1.21 ms per token, 828.29 tokens per second)
eval time = 251877.14 ms / 12547 tokens ( 20.07 ms per token, 49.81 tokens per second)
total time = 253065.12 ms / 13531 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 172.18.0.4 "POST /v1/chat/completions HTTP/1.1" 200 3138754 "Python/3.11 aiohttp/3.12.15" 4m39.339013547s
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 12548 | processing task
slot update_slots: id 0 | task 12548 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 13547
slot update_slots: id 0 | task 12548 | n_past = 985, cache_tokens.size() = 13530, seq_id = 0, pos_min = 11226, n_swa = 128
slot update_slots: id 0 | task 12548 | SWA checkpoint restore, pos_min = 0, pos_max = 983, size = 34.605 MiB
slot update_slots: id 0 | task 12548 | kv cache rm [983, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 3031, n_tokens = 2048, progress = 0.151177
slot update_slots: id 0 | task 12548 | kv cache rm [3031, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 5079, n_tokens = 2048, progress = 0.302355
slot update_slots: id 0 | task 12548 | kv cache rm [5079, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 7127, n_tokens = 2048, progress = 0.453532
slot update_slots: id 0 | task 12548 | kv cache rm [7127, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 9175, n_tokens = 2048, progress = 0.604710
slot update_slots: id 0 | task 12548 | kv cache rm [9175, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 11223, n_tokens = 2048, progress = 0.755887
slot update_slots: id 0 | task 12548 | kv cache rm [11223, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 13271, n_tokens = 2048, progress = 0.907064
slot update_slots: id 0 | task 12548 | kv cache rm [13271, end)
slot update_slots: id 0 | task 12548 | prompt processing progress, n_past = 13547, n_tokens = 276, progress = 0.927438
slot update_slots: id 0 | task 12548 | prompt done, n_past = 13547, n_tokens = 276
slot update_slots: id 0 | task 12548 | SWA checkpoint create, pos_min = 11243, pos_max = 13546, size = 81.027 MiB, total = 2/3 (115.632 MiB)
#stuck here
I am not able to reproduce. What command and client are you using?
Can you try adding -lv 1 to the server and see if there are any logs printed after getting stuck.
command is ./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --reasoning-format none --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1
-lv 1 is spitting out tons of "dissolution", I reported this at #15516, looks like its related, client is OpenWebUI
Full terminal output: https://gist.github.com/AbdullahMPrograms/0c739e241260b1b66af8de04e0db21cd
Try again without --reasoning-format none. Your request has the harmony tokens in the assistant message. In my experience, this causes weird hallucinations from the model. It is imperative they are stripped. --reasoning-format none leaves them in, which places the responsibility on the client to remove them.
Same issue, removed --reasoning-format none, command is now: ./llama-server -m /home/ultimis/LLM/Models/ggml-org/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -c 131072 -ngl 999 -b 2048 -ub 2048 -fa --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --host 0.0.0.0 --port 8081 -lv 1
Started a fresh chat same prompt, asked a follow question same "dissolution" error: https://gist.github.com/AbdullahMPrograms/10b20002899de6ea28ac9064bd34cd14
Could be an issue with ROCm backend (cc @IMbackK)
@AbdullahMPrograms Could you try running with the Vulkan backend?
Exact same issue on vulkan: https://gist.github.com/AbdullahMPrograms/15e1ba6a43c26974e97f7a1b897bab2f
Hm, not sure. I've been doing a lot of usage with the Metal backend and haven't observed issue with large prompts with gpt-oss-120b. I know that the CUDA backend also works correctly for the same cases.
Observing this issue both with ROCm and Vulkan is quite strange. Don't have a good explanation.
both the rocm backend on gfx11 and the vulkan backend accumulate @ fp16 in mul_mat, i gues this could be an overflow? I also cant repoduce this on CDNA but that accumulates at fp32
You can check for this by compiling with GGML_CUDA_FORCE_MMQ=On (which will be slow instead)
RDNA3 can do high precision accumulation too with V_WMMA_F32_16X16X16_F16 with no performance downside, but amd has not implemented the use of this instruction into rocblas. They have added kernels to hipblaslt that dose use this instruction, but they only use those from hipblas on RDNA4, for no technical reason.
I recommend repeatedly complaining loudly to AMD about this pointless deficiency, and the incredibly poor state of amds blas libraries (its not market segmentation, its bad on CDNA too) in general.
Due to the these missing kernels in hipblaslt does that mean this is not fixable? I've begun to notice more and more this stalled generation issue while using GPT-OSS
@AbdullahMPrograms Regarding the Vulkan backend, I think that folks have narrowed down the issue here: https://github.com/ggml-org/llama.cpp/issues/15274#issuecomment-3225703560. Can you try running with F16 disabled (as described there) and confirm that this also fixes the issue for you?
@ggerganov this fixes the issue for vulkan! Vulkan is still not as performant for text generation as ROCm but at least it works!
GGML_CUDA_FORCE_MMQ should work for the rocm backend too
Due to the these missing kernels in hipblaslt does that mean this is not fixable? I've begun to notice more and more this stalled generation issue while using GPT-OSS
Its not directly fixable from our side, the options are:
- Amd could easily fix this issue by cross dispatching to hiblaslt like they do on RDNA4.
- we could avoid amds's blas mess altogether by implementing wmma in mmq instead, but i dont have any wmma rocm device to do development on, so for now i dont think anyone is working on this.
edit: i gues a model specific fix that could be done is to figure out where the higher precision is necessary for this model and accumulate at 32bit only those operations, but honestly that is kinda dumb since unlike some NVIDIA devices all AMD devices are perfectly capable of always accumulating at high precision with good performance.
Hit this same issue myself on gfx1151 hardware. Recompiling with GGML_CUDA_FORCE_MMQ=ON fixed this for me.
Recompiling with -DGGML_CUDA_FORCE_MMQ=ON however has solved the issue for me, I have not yet done any speed testing but it seems to be comparable
for gfx11 gpus the performance impact of GGML_CUDA_FORCE_MMQ is huge, but only for prompt processing/prefill as mmq is used for token generation anyhow.
Currently gfx11 is especially problematic as the blas path is slow for fp32 accumulation and the mmq path is slow in general. For gfx10 and below the mmq path is fast and for gfx12 the blas path dose fp32 accumulation.
I’m hitting the same problem on AMD, 2× W7800 48 GB (ROCm)
I can reproduce the context‑roll stall described in this issue on a dual‑GPU AMD system.
Environment • GPUs: 2× Radeon Pro W7800 (48 GB each, RDNA3) • Backend: ROCm (HIP), ROCm 7.0.1 • llama.cpp: llama-server built with HIP/hipBLAS (recent master) • Model: GPT‑OSS 120B (GGUF, MXFP4, multi‑shard) • Kernel cmdline: iommu=pt pci=realloc=off pci=bfsort amdgpu.audio=0 amdgpu.runpm=1 amdgpu.aspm=1
What happens 1. A long response (tens of k tokens of prior history) generates quickly and looks normal. 2. On the very next turn, right when the context should slide/roll, generation stalls: no new tokens are emitted. 3. During the stall, both GPUs oscillate between 0% and 100% load repeatedly. It never recovers; it only clears after unloading/reloading the model (or restarting the server). 4. The model is not evicted from VRAM; it simply stops producing tokens at the roll step.
This matches the SWA‑related behavior described in #15517 (stall around SWA checkpoint create/restore).
What I’ve tried (no change) • Reducing -c (e.g., to 8192), varying -ub (512/1024), changing --keep (128–512). • Enabling --kv-unified and --defrag-thold 0.10. • Disabling context shift (--no-context-shift) does avoid the stall, but then there is no rolling (not acceptable for my use case).
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue is very likely related to the fact that we do fp16 mma accumulation on some devices: gfx11 and volta
I dont have a gfx11 card myself but AFAIK the perfomance impact of using fp32 accumulation should not be so bad on gfx11 from a hardware perspective, but rocblas and hipblaslt in the past has provided poor kernels for this case.
I would be helpfull if someone could use rocblas-bench to figure out if this is still the case.
We can also try and figure out which tensors in this case cause the accumulator overflow and just raise the precision for those, this has been done before to counter this issue, but its playing endless whack a mole.