vllm [New Model][Format]: Support the HF-version of Pixtral

The model to consider.

vLLM supports mistral's "consolidated" format for the Pixtral model found at: https://huggingface.co/mistral-community/pixtral-12b-240910

However when HF implemented Pixtral in Transformers, they use a different format leveraging the existing Llava model structure. Model example: https://huggingface.co/mistral-community/pixtral-12b

HF PR reference: https://github.com/huggingface/transformers/pull/33449

Supporting the HF version means we can produce quantized versions of the model with LLM Compressor

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

Easy to moderate, all operations should already be implemented inside of vLLM

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 21 '24 00:09 mgoin

Related issues:

https://github.com/vllm-project/vllm/issues/8566

Sep 21 '24 00:09 mgoin

Do you have any suggested work arounds at this time?

Sep 30 '24 17:09 Reichenbachian

@Reichenbachian you can use the official mistral consolidated checkpoint with vllm if you want to use pixtral.

As for supporting the HF format, we are still waiting on someone to contribute the implementation.

Sep 30 '24 18:09 mgoin

Hey @mgoin we have a fine tuned version that has gone through the transformers library, so using the consolidated checkpoint isn't going to work for us unfortunately. If you can point me in the right direction though, I might be able to implement it

Sep 30 '24 18:09 Reichenbachian

Otherwise, we may just retrain with Llama 3.2 vision. Sorry for the OOS question, but are you aware of any similar issues there?

Sep 30 '24 18:09 Reichenbachian

Thanks @Reichenbachian we definitely want to have the implementation, for similar reasons. The key part that needs to be implemneted is registering the pixtral vision tower to the _init_vision_tower function in llava.py

Llama 3.2 Vision is supported and should be fine for that usecase, but it is less optimized since the cross-attention architecture it uses is much less common than the Llava-style most other VLMs have been using.

Oct 01 '24 17:10 mgoin

I have started a draft here where we can load the weights properly. It still needs more work to properly perform inference https://github.com/vllm-project/vllm/pull/9036

Oct 03 '24 07:10 mgoin

@Reichenbachian if you simply want to run your own fine-tuned version, there is this user that wrote a conversion script from HF format --> Mistral format https://github.com/spring-anth/transform_pixtral/blob/main/convert_hf_transformers_pixtral_model_to_vllm_compatible_version.py So you could just have an extra pass to convert your model to the supported format

Oct 04 '24 17:10 mgoin