Integrating custom mlx models
Custom MLX Models Support - Issue #918
Motivation
Fixes Issue #918: Enable users to run custom MLX models from mlx-community on Hugging Face without manual code updates.
What Changed
1. Frontend UI for Custom Models
Commit: "Add custom models to dashboard"
- Added "Custom Models" section to downloads page with HuggingFace model ID input
- Implemented "Download and Run" button triggering model placement into the Pipeline
- Extended store with
placeInstance()method for backend API communication
2. Automated Tests
Commit: "Add integration test"
- Created
src/exo/worker/tests/test_custom_model.py- integration test verifying:- Custom model placement is triggered correctly
- Model downloads and loads successfully
- Chat inference works with loaded model
- Added
.github/workflows/test_custom_models.yml- GitHub Action for automated CI testing
3. Persistent Storage & Model Registration
Commit: "Persist storage for custom models"
- Fixed
resolve_model_meta()to check both short_id keys and full model_id values - Enabled custom model registration to
~/.exo/custom_models.jsonduring download - Models reload automatically on EXO restart from persistent storage
Why It Works
This implementation enables dynamic custom model loading without requiring code modifications for each new model. Users can:
- Download any
mlx-communitymodel via the dashboard - Have models persist across restarts
- Test out the model once it loads
Known Issues
1. Missing chat_template.jinja for Some Models
Some mlx-community models don't include a chat template, causing the model to output its Instructions instead of formatted chat responses. This is a model-specific issue with mlx-community models, not a bug in our implementation.
Workaround: Use models that include proper chat templates (e.g., `mlx-community/Qwen2.5-0.5B-Instruct-4bit or add a chat template.jinja yourself.
Testing
Manual Testing
- Hardware: MacBook Pro (M4 Pro)
- Tested with
mlx-community/gpt-oss-20b-MXFP4-Q8 - Verified:
- Model appears in downloads list with correct size
- Download progress bar updates in real-time
- Model persists in
~/.exo/custom_models.json - Model is available after restart
- Chat inference works correctly
Automated Testing
- Integration test:
src/exo/worker/tests/test_custom_model.py - CI workflow:
.github/workflows/test_custom_models.yml - Input validation: Only allows
mlx-communitymodels in downloads
Files Modified
src/exo/master/api.py- Model resolution & API responsesrc/exo/shared/models/model_cards.py- Persistence logicsrc/exo/worker/download/impl_shard_downloader.py- Registration on downloaddashboard/src/routes/downloads/+page.svelte- Custom models UIdashboard/src/lib/stores/app.svelte.ts- API integration
Hi, I have successfully added a new Feature: Testing custom MLX models
Can Someone please clone & run my fork to verify downloading a larger model like mlx-community/gpt-oss-20b-MXFP4-Q8? I don’t have enough RAM :/
I hope this is something we wanted. Currently only for testing purposes.
Not sure why my VSCode Prettier auto prettified all the files I’ve changed.
I will probably create a new clean PR where I only change the required code blocks, to keep it clean, if it’s needed to Approve this feature request.
Looks good! I wonder if we should directly add the model to the model cards instead of a separate KNOWN_MODELS but there's wider questions to be answered in there.
As for prettier, I don't believe our current formatter extends to the dashboard so I don't particularly mind atm
Looks good! I wonder if we should directly add the model to the model cards instead of a separate KNOWN_MODELS but there's wider questions to be answered in there.
My idea was after the users tests it and verify, then we add it model_cards as official supported model. but yeah, can be skipped.
Ok - gpt-oss-20b-MXFP4-Q8 did not work, but the download was completely fine, seems like an upstream problem.
Ok - gpt-oss-20b-MXFP4-Q8 did not work, but the download was completely fine, seems like an upstream problem.
Yes I see the erorr. this might be more difficult than I thought. Runner 4e13d976-5262-43eb-b513-e9678e673e59 crashed with critical exception Quantized SDPA does not support attention sinks
This isn't an issue for this PR - we need to bump mlx versions and test afaik.
Ok it’s working. GPT-OSS- model loaded. However I had to adedd TEMPORARY overrides as in my commit: 2e446ab Not ideal, we need to wait for official mlx support version.
GPT-oss-20b has no chat_template.jinja resulting in artifacts and instructions appearing in chat:
QUERY
Hello
EXO
09:25:43
TTFT 555ms•70.7 tok/s
<|channel|>analysis<|message|>We need to be helpful, concise, no reasoning inside answer. Respond "Hello". Maybe ask how to help.<|end|><|start|>assistant<|channel|>final<|message|>Hello! How can I help you today?
appreciate the enthusiasm but can we keep this pr down to custom models? the gpt-oss fix is a separate issue.
Hi, I have removed the specific memory overrides for gpt-oss-20b model. Can I now request a review from Developer / Maintainer for this PR: #937 ?
Thank you
Please continue! I'm excited to get this feature in EXO. I appreciate your patience while we work out all the details.