exo Prepend <think> tag to stream for thinking models like GLM-4.7

Motivation

For thinking models like GLM-4.7, the <think> tag is inserted by the tokenizer's apply_chat_template() into the prompt (input). The model generates tokens starting after this tag, so <think> never appears in the streamed output. The frontend expects <think>...</think> tags to extract and display thinking content.

Log evidence:

[gMASK]<sop><|system|>...<|user|>...<|assistant|><think>

The prompt ends with <think>, so the model generates content after it, never returning the opening tag.

Changes

Added detect_thinking_prompt_suffix() helper function in utils_mlx.py to detect if a prompt ends with <think> tag
Added parse_thinking_models() generator wrapper in runner.py that prepends the thinking tag to the output stream
Modified the main generation loop to use the thinking wrapper for non-GptOssModel models when a thinking prefix is detected
Updated test mocks to handle the new apply_chat_template call

Why It Works

The solution follows the same pattern as parse_gpt_oss() - a generator wrapper that transforms the output stream. When the chat template ends with <think>, we prepend this tag to the first generated token so the frontend receives the complete <think>...</think> structure it expects.

Test Plan

Manual Testing

Run exo: uv run exo

Send a chat request to GLM-4.7:

curl http://localhost:52415/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "mlx-community/GLM-4.7-8bit-gs32",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "stream": true
}'

Verify the streamed response starts with <think> tag
Verify the frontend dashboard correctly shows the thinking section collapsed

Automated Testing

All 72 worker tests pass: uv run pytest src/exo/worker/
Type checker passes: uv run basedpyright
Linter passes: uv run ruff check

🤖 Generated with Claude Code

Jan 17 '26 22:01 AlexCheema

Tested manually GLM shows correctly now with the thinking block. Previously there would be no thinking block.

Jan 17 '26 23:01 AlexCheema

Addressed Reviewer Comments

This commit addresses both reviewer concerns:

1. Duplicate `apply_chat_template` call removed

Previously, apply_chat_template was called twice:

Once inside mlx_generate() to build the prompt for generation
Once in runner.py to detect thinking tag suffix

Now it's called once in runner.py and the prompt is passed to mlx_generate() as a parameter. This eliminates the inefficient duplication.

2. Test mock comment clarified

Updated the comment on the apply_chat_template mock in test_event_ordering.py to explain:

Why it's needed: the test uses a fake tokenizer (integer 1)
What it does: returns a prompt without thinking tag so detect_thinking_prompt_suffix returns None

Changes

File	Change
`generate.py`	Added `prompt: str` parameter, removed internal `apply_chat_template` call
`runner.py`	Call `apply_chat_template` once before `mlx_generate`, pass prompt, reuse for thinking detection
`test_event_ordering.py`	Updated mock comment to explain its purpose

All checks pass: basedpyright, ruff check, pytest (151 tests), nix fmt

Jan 19 '26 01:01 AlexCheema

Prepend <think> tag to stream for thinking models like GLM-4.7

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Addressed Reviewer Comments

1. Duplicate apply_chat_template call removed

2. Test mock comment clarified

Changes

1. Duplicate `apply_chat_template` call removed