exo icon indicating copy to clipboard operation
exo copied to clipboard

Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility

Open AlexCheema opened this issue 2 months ago • 0 comments

Motivation

Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models.

The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes.

Changes

Core Changes

  • mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes
  • transformers 5.x compatibility: Enable prerelease transformers dependency

Kimi K2 Tokenizer Fixes

  • Add bytes_to_unicode monkey-patch to restore function moved in transformers 5.0.0rc2
  • Load TikTokenTokenizer directly instead of via AutoTokenizer to bypass transformers 5.x bug with auto_map fallback
  • Patch encode() to use tiktoken directly with allowed_special="all" to handle special tokens from chat templates

Other Changes

  • Dashboard: Show disk usage for completed model downloads
  • CI: Add workflow_dispatch trigger to build-app workflow
  • Docs: Add basic API documentation

Testing

  • Add comprehensive tokenizer unit tests for all supported models
  • Tests verify encode/decode, special token handling, and chat template encoding

Why It Works

bytes_to_unicode issue: transformers 5.0.0rc2 moved bytes_to_unicode from transformers.models.gpt2.tokenization_gpt2 to transformers.convert_slow_tokenizer. Kimi's tokenization_kimi.py imports from the old location. The monkey-patch restores it at module load time.

AutoTokenizer issue: transformers 5.x has a bug where tokenizer_class_from_name('TikTokenTokenizer') returns None for custom tokenizers with auto_map. Loading the tokenizer directly bypasses this.

encode() issue: transformers 5.x's pad() method fails for slow tokenizers. Using tiktoken's encode directly with allowed_special="all" avoids this path and properly handles special tokens like <|im_user|> from chat templates.

Test Plan

Manual Testing

  • Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21-1)
  • Tested Kimi K2 Thinking model with pipeline parallelism across both nodes
  • Verified warmup inference completes successfully
  • Verified chat completions work with special tokens

Automated Testing

  • Added test_tokenizers.py with 31 tests covering:
    • Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm)
    • Special token encoding (critical for chat templates)
    • Chat template application and encoding
    • Kimi-specific and GLM-specific edge cases
  • All tests pass: uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py

🤖 Generated with Claude Code

AlexCheema avatar Jan 11 '26 16:01 AlexCheema