sglang [Feature] support mistral small vlm

Checklist

[ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 2. Please use English, otherwise it will be closed.

Motivation

https://mistral.ai/fr/news/mistral-small-3-1

Related resources

No response

Mar 17 '25 18:03 zhyncs

hello everyone, could i help as a contributor with this issue? if yes, any tips on where to start ?

Mar 25 '25 16:03 draqos

is this supported yet? @zhyncs

Mar 28 '25 18:03 rjmehta1993

I think it not supported yet, maybe you can refer: https://docs.sglang.ai/references/supported_models.html#how-to-support-a-new-vlm

Mar 30 '25 16:03 FlamingoPg

@zhaochenyang20 I would like to work on this issue!

Apr 01 '25 16:04 Nagi-ovo

@zhaochenyang20 I would like to work on this issue!

This should be a proposal, like what's on your plan to support it.

Apr 01 '25 17:04 zhaochenyang20

M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.

That said, I'd mention a few notes:

it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so. https://github.com/sgl-project/sglang/issues/2351
MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?
(update) as I suspect, we may need alignments between mistral vs HF formats. https://github.com/vllm-project/vllm/issues/15212

Apr 01 '25 21:04 KivenChen

M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.

That said, I'd mention a few notes:

it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so.

MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?

(update) as I suspect, we may need alignments between mistral vs HF formats. [Feature]: Mistral Small 3.1 HF support vllm-project/vllm#15212

thanks!

Apr 01 '25 21:04 zhaochenyang20

As https://github.com/sgl-project/sglang/issues/4518#issuecomment-2770738997 matches my preliminary investigation findings. Based on this, our implementation roadmap should be:

Pixtral Integration: https://github.com/sgl-project/sglang/issues/2351, which can be implemented with reference to https://github.com/huggingface/transformers/pull/33449
Mistral Small 3.1 Integration, in relation to https://github.com/vllm-project/vllm/pull/15505 & Release Mistral 3 (Based on v4.49.0))

Current Status and Implementation Plan:

Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.
Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.
Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.
Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.
Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.

@zhaochenyang20

Apr 02 '25 04:04 Nagi-ovo

As #4518 (comment) matches my preliminary investigation findings. Based on this, our implementation roadmap should be:

Pixtral Integration: [Feature] Support mistralai/Pixtral #2351, which can be implemented with reference to Add support for Pixtral huggingface/transformers#33449

Mistral Small 3.1 Integration, in relation to [Model] Support Mistral3 in the HF Transformers format vllm-project/vllm#15505 & Release Mistral 3 (Based on v4.49.0))

Current Status and Implementation Plan:

Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.

Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.

Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.

Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.

Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.

@zhaochenyang20

great! go for it

Apr 02 '25 06:04 zhaochenyang20

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.

SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

Apr 03 '25 01:04 KivenChen

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.

SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.

Apr 05 '25 02:04 yhyang201

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards. SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.

Thanks for pointing it out, @yhyang201! I just had my model registry updated to latest commit. As you mentioned, this batch update two weeks ago merged the two formats for many MAI repos.

However, the official pixtral-12b is still on legacy list.

Therefore I suggest bypassing it for now. Instead, reuse LLaVA arch to support community version; likely the same approach for M3.1S.

Apr 05 '25 06:04 KivenChen

With that said, it doesn't seem trivial.

In the meantime, I'm tweaking a SGL branch for own use case. It has minimum runnable pixtral-12b/ support through llava, tested w/ tp_size {1, 2}. However, still troubleshooting some inconsistencies in generated output.

if that aligns with our ongoing efforts, I can draft a PR after clean-ups.

Apr 05 '25 07:04 KivenChen

is supported now?

May 12 '25 09:05 Stealthwriter

@Stealthwriter we are rebasing and adding CI to VLMs. Will merge it recently

May 12 '25 17:05 zhaochenyang20

Completed by https://github.com/sgl-project/sglang/pull/5099

May 21 '25 15:05 b8zhong