exo icon indicating copy to clipboard operation
exo copied to clipboard

Prepend <think> tag to stream for thinking models like GLM-4.7

Open AlexCheema opened this issue 1 month ago • 2 comments

Motivation

For thinking models like GLM-4.7, the <think> tag is inserted by the tokenizer's apply_chat_template() into the prompt (input). The model generates tokens starting after this tag, so <think> never appears in the streamed output. The frontend expects <think>...</think> tags to extract and display thinking content.

Log evidence:

[gMASK]<sop><|system|>...<|user|>...<|assistant|><think>

The prompt ends with <think>, so the model generates content after it, never returning the opening tag.

Changes

  • Added detect_thinking_prompt_suffix() helper function in utils_mlx.py to detect if a prompt ends with <think> tag
  • Added parse_thinking_models() generator wrapper in runner.py that prepends the thinking tag to the output stream
  • Modified the main generation loop to use the thinking wrapper for non-GptOssModel models when a thinking prefix is detected
  • Updated test mocks to handle the new apply_chat_template call

Why It Works

The solution follows the same pattern as parse_gpt_oss() - a generator wrapper that transforms the output stream. When the chat template ends with <think>, we prepend this tag to the first generated token so the frontend receives the complete <think>...</think> structure it expects.

Test Plan

Manual Testing

  • Run exo: uv run exo
  • Send a chat request to GLM-4.7:
    curl http://localhost:52415/v1/chat/completions -H "Content-Type: application/json" -d '{
      "model": "mlx-community/GLM-4.7-8bit-gs32",
      "messages": [{"role": "user", "content": "What is 2+2?"}],
      "stream": true
    }'
    
  • Verify the streamed response starts with <think> tag
  • Verify the frontend dashboard correctly shows the thinking section collapsed

Automated Testing

  • All 72 worker tests pass: uv run pytest src/exo/worker/
  • Type checker passes: uv run basedpyright
  • Linter passes: uv run ruff check

🤖 Generated with Claude Code

AlexCheema avatar Jan 17 '26 22:01 AlexCheema

Tested manually GLM shows correctly now with the thinking block. Previously there would be no thinking block.

Screenshot 2026-01-17 at 11 26 31 PM

AlexCheema avatar Jan 17 '26 23:01 AlexCheema

Addressed Reviewer Comments

This commit addresses both reviewer concerns:

1. Duplicate apply_chat_template call removed

Previously, apply_chat_template was called twice:

  • Once inside mlx_generate() to build the prompt for generation
  • Once in runner.py to detect thinking tag suffix

Now it's called once in runner.py and the prompt is passed to mlx_generate() as a parameter. This eliminates the inefficient duplication.

2. Test mock comment clarified

Updated the comment on the apply_chat_template mock in test_event_ordering.py to explain:

  • Why it's needed: the test uses a fake tokenizer (integer 1)
  • What it does: returns a prompt without thinking tag so detect_thinking_prompt_suffix returns None

Changes

File Change
generate.py Added prompt: str parameter, removed internal apply_chat_template call
runner.py Call apply_chat_template once before mlx_generate, pass prompt, reuse for thinking detection
test_event_ordering.py Updated mock comment to explain its purpose

All checks pass: basedpyright, ruff check, pytest (151 tests), nix fmt

AlexCheema avatar Jan 19 '26 01:01 AlexCheema