Run custom MLX models
Is it possible to run custom MLX models from HuggingFace like for example https://huggingface.co/mlx-community/gpt-oss-20b-MXFP4-Q8? If not, I think this would be a feature many would be interested in. Particularly now since Apple released RDMA support.
Probably the single feature we want to ship ASAP - Christmas notwithstanding. It's totally possible right now through model_cards.py, while some models require more custom support for tensor sharding, gpt-oss-20b is (I believe) not one of them.
Hi, I’ve been sort of implementing this on my own fork of exo. How can I help with this?
Testing is perhaps the main thing - add a model card to src/exo/shared/model_cards/model_cards.py and see what breaks. As a heads up @rltakashige is also working on this but YMMV
We had some version incompatibility between gpt oss and kimi k2's tokenizers - seems like the release candidate of transformers isnt quite there yet.
Ok I will try my best to test out the smaller models. I don’t have a powerful hardware to test out the bigger ones you mentioned. #937
GPT OSS and GLM sharding support is around the corner, as well as a few more types of Qwen models.
There is a transformers version incompatibility with Ministral3 models, which @Evanev7 pointed out.
hola