oumi [Feature] Add support for `Pixtral` vision-language model

[Feature] Add support for `Pixtral` vision-language model

Open nikg4 opened this issue 9 months ago • 1 comments

Feature request

https://mistral.ai/news/pixtral-12b/

Motivation / references

https://mistral.ai/news/pixtral-12b/ - it's a capable visual-language model from Mixtral (released Sep 17, 2024)

https://huggingface.co/mistralai/Pixtral-12B-Base-2409
https://huggingface.co/mistralai/Pixtral-12B-2409
https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 (124B)

Your contribution

If somebody can volunteer to start this work, I can answer questions and help with testing

Towards OPE-1030

Feb 06 '25 23:02 nikg4

This issue sounds interesting. Could you provide me with more details? thank you

May 29 '25 23:05 scopophobic

Hi @scopophobic , sorry for the late response! This involves adding configs to Oumi to be able to train, infer, and eval the target model. See oru guide here: https://docs.google.com/document/d/1ZzDt3nd4sLEfYEooJsAPviecbgvshsW540aR3oRTzqQ/edit?usp=sharing

Jun 11 '25 21:06 wizeng23

hey @wizeng23 Just wanted to share a quick update — I’ve set up the initial LoRA training config for Pixtral-12B and got the training loop running end-to-end using a small dummy text-only dataset. Everything loads and runs without crashing so far, so that’s a win 😅 Next, I’m trying to better understand Pixtral’s vision capabilities, especially how to handle image-text inputs during training. Are there any datasets you’d recommend for fine-tuning or experimenting with VLMs like Pixtral? Even a public benchmark you’ve tested would be super helpful. Also, I’ve done all this locally, but I’ll need to move training to a cloud setup (probably GCP) since my Mac doesn’t have the hardware actually to train at scale. For now, just focusing on getting the setup solid and understanding the data path.

Jun 15 '25 19:06 scopophobic

Hi @scopophobic - A good dataset to start is the very manageable (pixel size, tokens, etc.) VQA-Small. If you "saturate" this fast, then you can also explore the Caulron collection, which offers some very interesting sub-collections (e.g., geometry problems), including multi-image experiments.

Jun 17 '25 04:06 optas

oumi oumi copied to clipboard

[Feature] Add support for `Pixtral` vision-language model

Feature request

Motivation / references

Your contribution

oumi
oumi copied to clipboard