oumi
oumi copied to clipboard
[Feature] Add support for `Pixtral` vision-language model
Feature request
https://mistral.ai/news/pixtral-12b/
Motivation / references
https://mistral.ai/news/pixtral-12b/ - it's a capable visual-language model from Mixtral (released Sep 17, 2024)
- https://huggingface.co/mistralai/Pixtral-12B-Base-2409
- https://huggingface.co/mistralai/Pixtral-12B-2409
- https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 (124B)
Your contribution
If somebody can volunteer to start this work, I can answer questions and help with testing
Towards OPE-1030
This issue sounds interesting. Could you provide me with more details? thank you
Hi @scopophobic , sorry for the late response! This involves adding configs to Oumi to be able to train, infer, and eval the target model. See oru guide here: https://docs.google.com/document/d/1ZzDt3nd4sLEfYEooJsAPviecbgvshsW540aR3oRTzqQ/edit?usp=sharing
hey @wizeng23 Just wanted to share a quick update — I’ve set up the initial LoRA training config for Pixtral-12B and got the training loop running end-to-end using a small dummy text-only dataset. Everything loads and runs without crashing so far, so that’s a win 😅 Next, I’m trying to better understand Pixtral’s vision capabilities, especially how to handle image-text inputs during training. Are there any datasets you’d recommend for fine-tuning or experimenting with VLMs like Pixtral? Even a public benchmark you’ve tested would be super helpful. Also, I’ve done all this locally, but I’ll need to move training to a cloud setup (probably GCP) since my Mac doesn’t have the hardware actually to train at scale. For now, just focusing on getting the setup solid and understanding the data path.