Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility
Motivation
Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models.
The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes.
Changes
Core Changes
- mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes
- transformers 5.x compatibility: Enable prerelease transformers dependency
Kimi K2 Tokenizer Fixes
- Add
bytes_to_unicodemonkey-patch to restore function moved in transformers 5.0.0rc2 - Load
TikTokenTokenizerdirectly instead of viaAutoTokenizerto bypass transformers 5.x bug withauto_mapfallback - Patch
encode()to use tiktoken directly withallowed_special="all"to handle special tokens from chat templates
Other Changes
- Dashboard: Show disk usage for completed model downloads
- CI: Add
workflow_dispatchtrigger to build-app workflow - Docs: Add basic API documentation
Testing
- Add comprehensive tokenizer unit tests for all supported models
- Tests verify encode/decode, special token handling, and chat template encoding
Why It Works
bytes_to_unicode issue: transformers 5.0.0rc2 moved bytes_to_unicode from transformers.models.gpt2.tokenization_gpt2 to transformers.convert_slow_tokenizer. Kimi's tokenization_kimi.py imports from the old location. The monkey-patch restores it at module load time.
AutoTokenizer issue: transformers 5.x has a bug where tokenizer_class_from_name('TikTokenTokenizer') returns None for custom tokenizers with auto_map. Loading the tokenizer directly bypasses this.
encode() issue: transformers 5.x's pad() method fails for slow tokenizers. Using tiktoken's encode directly with allowed_special="all" avoids this path and properly handles special tokens like <|im_user|> from chat templates.
Test Plan
Manual Testing
- Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21-1)
- Tested Kimi K2 Thinking model with pipeline parallelism across both nodes
- Verified warmup inference completes successfully
- Verified chat completions work with special tokens
Automated Testing
- Added
test_tokenizers.pywith 31 tests covering:- Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm)
- Special token encoding (critical for chat templates)
- Chat template application and encoding
- Kimi-specific and GLM-specific edge cases
- All tests pass:
uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py
🤖 Generated with Claude Code