Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility

Open AlexCheema opened this issue 2 months ago • 0 comments

Motivation

Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models.

The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes.

Changes

Core Changes

mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes
transformers 5.x compatibility: Enable prerelease transformers dependency

Kimi K2 Tokenizer Fixes

Add bytes_to_unicode monkey-patch to restore function moved in transformers 5.0.0rc2
Load TikTokenTokenizer directly instead of via AutoTokenizer to bypass transformers 5.x bug with auto_map fallback
Patch encode() to use tiktoken directly with allowed_special="all" to handle special tokens from chat templates

Other Changes

Dashboard: Show disk usage for completed model downloads
CI: Add workflow_dispatch trigger to build-app workflow
Docs: Add basic API documentation

Testing

Add comprehensive tokenizer unit tests for all supported models
Tests verify encode/decode, special token handling, and chat template encoding

Why It Works

bytes_to_unicode issue: transformers 5.0.0rc2 moved bytes_to_unicode from transformers.models.gpt2.tokenization_gpt2 to transformers.convert_slow_tokenizer. Kimi's tokenization_kimi.py imports from the old location. The monkey-patch restores it at module load time.

AutoTokenizer issue: transformers 5.x has a bug where tokenizer_class_from_name('TikTokenTokenizer') returns None for custom tokenizers with auto_map. Loading the tokenizer directly bypasses this.

encode() issue: transformers 5.x's pad() method fails for slow tokenizers. Using tiktoken's encode directly with allowed_special="all" avoids this path and properly handles special tokens like <|im_user|> from chat templates.

Test Plan

Manual Testing

Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21-1)
Tested Kimi K2 Thinking model with pipeline parallelism across both nodes
Verified warmup inference completes successfully
Verified chat completions work with special tokens

Automated Testing

Added test_tokenizers.py with 31 tests covering:
- Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm)
- Special token encoding (critical for chat templates)
- Chat template application and encoding
- Kimi-specific and GLM-specific edge cases
All tests pass: uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py

🤖 Generated with Claude Code

Jan 11 '26 16:01 AlexCheema