[Feature] support mistral small vlm
Checklist
- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 2. Please use English, otherwise it will be closed.
Motivation
https://mistral.ai/fr/news/mistral-small-3-1
Related resources
No response
hello everyone, could i help as a contributor with this issue? if yes, any tips on where to start ?
is this supported yet? @zhyncs
I think it not supported yet, maybe you can refer: https://docs.sglang.ai/references/supported_models.html#how-to-support-a-new-vlm
@zhaochenyang20 I would like to work on this issue!
@zhaochenyang20 I would like to work on this issue!
This should be a proposal, like what's on your plan to support it.
M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.
That said, I'd mention a few notes:
- it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so. https://github.com/sgl-project/sglang/issues/2351
- MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?
- (update) as I suspect, we may need alignments between mistral vs HF formats. https://github.com/vllm-project/vllm/issues/15212
M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.
That said, I'd mention a few notes:
- it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so.
- MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?
- (update) as I suspect, we may need alignments between mistral vs HF formats. [Feature]: Mistral Small 3.1 HF support vllm-project/vllm#15212
thanks!
As https://github.com/sgl-project/sglang/issues/4518#issuecomment-2770738997 matches my preliminary investigation findings. Based on this, our implementation roadmap should be:
-
Pixtral Integration: https://github.com/sgl-project/sglang/issues/2351, which can be implemented with reference to https://github.com/huggingface/transformers/pull/33449
-
Mistral Small 3.1 Integration, in relation to https://github.com/vllm-project/vllm/pull/15505 & Release Mistral 3 (Based on v4.49.0))
Current Status and Implementation Plan:
-
Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.
-
Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.
-
Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.
-
Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.
-
Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.
@zhaochenyang20
As #4518 (comment) matches my preliminary investigation findings. Based on this, our implementation roadmap should be:
- Pixtral Integration: [Feature] Support mistralai/Pixtral #2351, which can be implemented with reference to Add support for Pixtral huggingface/transformers#33449
- Mistral Small 3.1 Integration, in relation to [Model] Support Mistral3 in the HF Transformers format vllm-project/vllm#15505 & Release Mistral 3 (Based on v4.49.0))
Current Status and Implementation Plan:
- Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.
- Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.
- Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.
- Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.
- Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.
great! go for it
Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.
SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20
Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.
SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20
Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.
Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards. SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20
Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.
Thanks for pointing it out, @yhyang201! I just had my model registry updated to latest commit. As you mentioned, this batch update two weeks ago merged the two formats for many MAI repos.
However, the official pixtral-12b is still on legacy list.
Therefore I suggest bypassing it for now. Instead, reuse LLaVA arch to support community version; likely the same approach for M3.1S.
With that said, it doesn't seem trivial.
In the meantime, I'm tweaking a SGL branch for own use case. It has minimum runnable pixtral-12b/ support through llava, tested w/ tp_size {1, 2}. However, still troubleshooting some inconsistencies in generated output.
if that aligns with our ongoing efforts, I can draft a PR after clean-ups.
is supported now?
@Stealthwriter we are rebasing and adding CI to VLMs. Will merge it recently
Completed by https://github.com/sgl-project/sglang/pull/5099