mlx-swift-examples Proposal: Add thinking Case to Generation Enum

Currently, the Generation enum has three cases: chunk, info, and toolCall.

Many newer APIs (such as Ollama’s thinking property in Message) now include special properties for "thinking" directly in their response data structures, rather than encoding such tokens in text.

Rationale

Different models may use varying tokens to represent "thinking," making it complicated to detect or filter these tokens at the application layer. Moving the responsibility for handling these special tokens to the inference engine would simplify integration and keep application code cleaner.

Proposed Change

Add a new .thinking case to the Generation enum:

public enum Generation: Sendable {
    /// A generated token represented as a String.
    case chunk(String)

    /// A generated "thinking" token, represented as a String.
    case thinking(String)

    /// Completion information summarizing token counts and performance metrics.
    case info(GenerateCompletionInfo)

    /// A tool call from the language model.
    case toolCall(ToolCall)
    ...
}

Considerations

Breaking Change: Adding a new enum case will require updates to any exhaustive switch statements that handle Generation in both the mlx-swift-examples code and third party apps using MLX-Swift.

Looking for feedback!

Jul 06 '25 23:07 ronaldmannak

That might go nicely with #310.

How would this work in practice? Let's say you have:

<think>a b c d</think>
response1 response2

Would you get:

.think(a)
.think(b)
.think(c)
.think(d)
.chunk(response1)
.chunk(response2)

Or would it collect all a b c d into a single unit (this might harm the streaming view of it)?

I think this would also need to deal with the case where the prompt embeds a <think> token to prime it.

Does this require support from the tokenizer to identify the start/stop think tokens?

Jul 17 '25 17:07 davidkoski

There still isn't much standardization around thinking / chains-of-thought around tokenizers and how to respond with them. But I do think it's worth making it easier for downstream applications if possible. My 2c on some of the points @davidkoski raised:

Or would it collect all a b c d into a single unit (this might harm the streaming view of it)?

It would be good to not break streaming especially since thinking can be many tokens.

I think this would also need to deal with the case where the prompt embeds a token to prime it.

Good point. That makes it a bit trickier.. as you could already be in the thinking section before generating any tokens.

Does this require support from the tokenizer to identify the start/stop think tokens?

I think so yes. Ideally this would be a property in the original tokenizer.. but it isn't yet. I started something similar in mlx-lm

Jul 18 '25 13:07 awni

That might go nicely with #310.

I haven't looked at #310 yet and I can't tell how close it is to be finished and accepted. Is there anything in this PR the enum and @awni's thinking property port might depend on?

How would this work in practice? Let's say you have:
<think>a b c d</think>
response1 response2
Would you get:
.think(a)
.think(b)
.think(c)
.think(d)
.chunk(response1)
.chunk(response2)
Or would it collect all a b c d into a single unit (this might harm the streaming view of it)?

I believe you want to stream it and have the client decide whether to show the stream or not. I think most clients show the thinking process stream to decrease perceived latency.

I think this would also need to deal with the case where the prompt embeds a <think> token to prime it.

That's an excellent point and I didn't think of that. This makes streaming tricky. I'm honestly not sure how clients handle this case now (maybe retroactively classifying received tokens as reasoning upon receiving a tag?) Since @awni's think_start property in mlx-lm is in the tokenizer (utilities), it wouldn't be triggered in this case either I assume. Is that correct? I can think of two solutions:

Don't handle this edge case.
Always explicitly emit an either "invisible" or "" thinking-ended tag so the client can handle the already received tokens in the appropriate way. This feels like a hack though since it's basically making what we already have more complex. Are there any other ways?

Jul 19 '25 14:07 ronaldmannak

One option might be to add a .prompt(token) and stream the incoming prompt as well -- then you will see the <think> tag and can handle it. Of course you probably don't want to stream a huge prompt and might need additional cases for roles and image markers (and their contents -- those aren't really tokens in there).

Jul 22 '25 16:07 davidkoski

mlx-swift-examples mlx-swift-examples copied to clipboard

Proposal: Add thinking Case to Generation Enum

Rationale

Proposed Change

Considerations

mlx-swift-examples
mlx-swift-examples copied to clipboard