[New Model][Format]: Support the HF-version of Pixtral
The model to consider.
vLLM supports mistral's "consolidated" format for the Pixtral model found at: https://huggingface.co/mistral-community/pixtral-12b-240910
However when HF implemented Pixtral in Transformers, they use a different format leveraging the existing Llava model structure. Model example: https://huggingface.co/mistral-community/pixtral-12b
HF PR reference: https://github.com/huggingface/transformers/pull/33449
Supporting the HF version means we can produce quantized versions of the model with LLM Compressor
The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
Easy to moderate, all operations should already be implemented inside of vLLM
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Related issues:
- https://github.com/vllm-project/vllm/issues/8566
Do you have any suggested work arounds at this time?
@Reichenbachian you can use the official mistral consolidated checkpoint with vllm if you want to use pixtral.
As for supporting the HF format, we are still waiting on someone to contribute the implementation.
Hey @mgoin we have a fine tuned version that has gone through the transformers library, so using the consolidated checkpoint isn't going to work for us unfortunately. If you can point me in the right direction though, I might be able to implement it
Otherwise, we may just retrain with Llama 3.2 vision. Sorry for the OOS question, but are you aware of any similar issues there?
Thanks @Reichenbachian we definitely want to have the implementation, for similar reasons. The key part that needs to be implemneted is registering the pixtral vision tower to the _init_vision_tower function in llava.py
Llama 3.2 Vision is supported and should be fine for that usecase, but it is less optimized since the cross-attention architecture it uses is much less common than the Llava-style most other VLMs have been using.
I have started a draft here where we can load the weights properly. It still needs more work to properly perform inference https://github.com/vllm-project/vllm/pull/9036
@Reichenbachian if you simply want to run your own fine-tuned version, there is this user that wrote a conversion script from HF format --> Mistral format https://github.com/spring-anth/transform_pixtral/blob/main/convert_hf_transformers_pixtral_model_to_vllm_compatible_version.py So you could just have an extra pass to convert your model to the supported format