sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Feature] support mistral small vlm

Open zhyncs opened this issue 9 months ago • 15 comments

Checklist

  • [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [ ] 2. Please use English, otherwise it will be closed.

Motivation

https://mistral.ai/fr/news/mistral-small-3-1

Related resources

No response

zhyncs avatar Mar 17 '25 18:03 zhyncs

hello everyone, could i help as a contributor with this issue? if yes, any tips on where to start ?

draqos avatar Mar 25 '25 16:03 draqos

is this supported yet? @zhyncs

rjmehta1993 avatar Mar 28 '25 18:03 rjmehta1993

I think it not supported yet, maybe you can refer: https://docs.sglang.ai/references/supported_models.html#how-to-support-a-new-vlm

FlamingoPg avatar Mar 30 '25 16:03 FlamingoPg

@zhaochenyang20 I would like to work on this issue!

Nagi-ovo avatar Apr 01 '25 16:04 Nagi-ovo

@zhaochenyang20 I would like to work on this issue!

This should be a proposal, like what's on your plan to support it.

zhaochenyang20 avatar Apr 01 '25 17:04 zhaochenyang20

M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.

That said, I'd mention a few notes:

  • it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so. https://github.com/sgl-project/sglang/issues/2351
  • MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?
  • (update) as I suspect, we may need alignments between mistral vs HF formats. https://github.com/vllm-project/vllm/issues/15212

KivenChen avatar Apr 01 '25 21:04 KivenChen

M3.1-S looks solid for my use case — mainly need an mmlm that runs on low-spec hw.

That said, I'd mention a few notes:

  • it's basically a Mistral text model + Pixtral vision tower. We’ll likely need to add pixtral support, start by bringing in SGL-optimized layer ops and so.
  • MM data layer seems under heavy dev recently — any doc or example on what guidelines to follow when handling multimodal input for a new model? or any slack threads?
  • (update) as I suspect, we may need alignments between mistral vs HF formats. [Feature]: Mistral Small 3.1 HF support vllm-project/vllm#15212

thanks!

zhaochenyang20 avatar Apr 01 '25 21:04 zhaochenyang20

As https://github.com/sgl-project/sglang/issues/4518#issuecomment-2770738997 matches my preliminary investigation findings. Based on this, our implementation roadmap should be:

  1. Pixtral Integration: https://github.com/sgl-project/sglang/issues/2351, which can be implemented with reference to https://github.com/huggingface/transformers/pull/33449

  2. Mistral Small 3.1 Integration, in relation to https://github.com/vllm-project/vllm/pull/15505 & Release Mistral 3 (Based on v4.49.0))

Current Status and Implementation Plan:

  1. Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.

  2. Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.

  3. Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.

  4. Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.

  5. Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.

@zhaochenyang20

Nagi-ovo avatar Apr 02 '25 04:04 Nagi-ovo

As #4518 (comment) matches my preliminary investigation findings. Based on this, our implementation roadmap should be:

  1. Pixtral Integration: [Feature] Support mistralai/Pixtral #2351, which can be implemented with reference to Add support for Pixtral huggingface/transformers#33449
  2. Mistral Small 3.1 Integration, in relation to [Model] Support Mistral3 in the HF Transformers format vllm-project/vllm#15505 & Release Mistral 3 (Based on v4.49.0))

Current Status and Implementation Plan:

  1. Status Quo: sglang already supports Mistral Small 3, but lacks support for the vision tower (Pixtral model) in Mistral Small 3.1.
  2. Implementation Strategy: First implement Pixtral support, following sglang's documentation for new VLM integration, building upon existing components like LlavaConfig.
  3. Expected Outcome: Full support for Pixtral, followed by complete integration with Mistral Small 3.1.
  4. Uncertainties: Given my technical limitations, I welcome any suggestions and collaboration opportunities. Feel free to reach out if interested in contributing.
  5. Timeline: Estimated 1 week for Pixtral implementation, followed by swift integration of Mistral Small 3.1.

@zhaochenyang20

great! go for it

zhaochenyang20 avatar Apr 02 '25 06:04 zhaochenyang20

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.

SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

KivenChen avatar Apr 03 '25 01:04 KivenChen

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards.

SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.

yhyang201 avatar Apr 05 '25 02:04 yhyang201

Found unusual impl while testing vllm for pixtral: starting from pixtral support#8377, Mistral-AI configs need separate handling, as their newly published models no longer complies with HF standards. SGL appears hf dependent for now. Do we have a future plan for m-ai model support? @zhaochenyang20

Based on the description and discussion thread on Hugging Face for [mistralai/Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503), it appears that the current version of the model is indeed in HF format.

Thanks for pointing it out, @yhyang201! I just had my model registry updated to latest commit. As you mentioned, this batch update two weeks ago merged the two formats for many MAI repos.

However, the official pixtral-12b is still on legacy list.

Therefore I suggest bypassing it for now. Instead, reuse LLaVA arch to support community version; likely the same approach for M3.1S.

KivenChen avatar Apr 05 '25 06:04 KivenChen

With that said, it doesn't seem trivial.

In the meantime, I'm tweaking a SGL branch for own use case. It has minimum runnable pixtral-12b/ support through llava, tested w/ tp_size {1, 2}. However, still troubleshooting some inconsistencies in generated output.

if that aligns with our ongoing efforts, I can draft a PR after clean-ups.

KivenChen avatar Apr 05 '25 07:04 KivenChen

is supported now?

Stealthwriter avatar May 12 '25 09:05 Stealthwriter

@Stealthwriter we are rebasing and adding CI to VLMs. Will merge it recently

zhaochenyang20 avatar May 12 '25 17:05 zhaochenyang20

Completed by https://github.com/sgl-project/sglang/pull/5099

b8zhong avatar May 21 '25 15:05 b8zhong