Prepend <think> tag to stream for thinking models like GLM-4.7
Motivation
For thinking models like GLM-4.7, the <think> tag is inserted by the tokenizer's apply_chat_template() into the prompt (input). The model generates tokens starting after this tag, so <think> never appears in the streamed output. The frontend expects <think>...</think> tags to extract and display thinking content.
Log evidence:
[gMASK]<sop><|system|>...<|user|>...<|assistant|><think>
The prompt ends with <think>, so the model generates content after it, never returning the opening tag.
Changes
- Added
detect_thinking_prompt_suffix()helper function inutils_mlx.pyto detect if a prompt ends with<think>tag - Added
parse_thinking_models()generator wrapper inrunner.pythat prepends the thinking tag to the output stream - Modified the main generation loop to use the thinking wrapper for non-GptOssModel models when a thinking prefix is detected
- Updated test mocks to handle the new
apply_chat_templatecall
Why It Works
The solution follows the same pattern as parse_gpt_oss() - a generator wrapper that transforms the output stream. When the chat template ends with <think>, we prepend this tag to the first generated token so the frontend receives the complete <think>...</think> structure it expects.
Test Plan
Manual Testing
- Run exo:
uv run exo - Send a chat request to GLM-4.7:
curl http://localhost:52415/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "mlx-community/GLM-4.7-8bit-gs32", "messages": [{"role": "user", "content": "What is 2+2?"}], "stream": true }' - Verify the streamed response starts with
<think>tag - Verify the frontend dashboard correctly shows the thinking section collapsed
Automated Testing
- All 72 worker tests pass:
uv run pytest src/exo/worker/ - Type checker passes:
uv run basedpyright - Linter passes:
uv run ruff check
🤖 Generated with Claude Code
Tested manually GLM shows correctly now with the thinking block. Previously there would be no thinking block.
Addressed Reviewer Comments
This commit addresses both reviewer concerns:
1. Duplicate apply_chat_template call removed
Previously, apply_chat_template was called twice:
- Once inside
mlx_generate()to build the prompt for generation - Once in
runner.pyto detect thinking tag suffix
Now it's called once in runner.py and the prompt is passed to mlx_generate() as a parameter. This eliminates the inefficient duplication.
2. Test mock comment clarified
Updated the comment on the apply_chat_template mock in test_event_ordering.py to explain:
- Why it's needed: the test uses a fake tokenizer (integer
1) - What it does: returns a prompt without thinking tag so
detect_thinking_prompt_suffixreturnsNone
Changes
| File | Change |
|---|---|
generate.py |
Added prompt: str parameter, removed internal apply_chat_template call |
runner.py |
Call apply_chat_template once before mlx_generate, pass prompt, reuse for thinking detection |
test_event_ordering.py |
Updated mock comment to explain its purpose |
All checks pass: basedpyright, ruff check, pytest (151 tests), nix fmt