mlx-vlm Does MLX-VLM support fine-tuning of vision tower layers?

Hi @Blaizzy , thank you for your amazing contribution to the MLX community.

Quick question please: Does MLX-VLM support fine-tuning of vision layers? From what I see in this piece of code, it appears that LoRa linear layers are applied exclusively to the language model weight matrices. I am curious as to why vision-LoRa layers were not incorporated for projection and FFN modules.

Also, please correct me if I'm wrong: If I need to fine-tune the vision layers as well, can I use the same logic as I see here by adding model.vision_tower to the list of modules as well for LoRa fine-tuning?

Once again, many thanks for your amazing work!

Apr 04 '25 12:04 sachinraja13

Hey @sachinraja13

My pleasure, it means a lot!

Not yet, but we will add it along side the projector. Right now, @Goekdeniz-Guelmez is actively working on doing an overhaul on the trainer 🔥.

However, 99% of the time you don't need to finetune the vision tower VLM. There is vast amount of research on it.

Unless you want to train a VLM from scratch or on out-of-domain data.

Apr 04 '25 12:04 Blaizzy

Yes, you could do that.

You would then train a vision adapter that would need to load.

Apr 04 '25 12:04 Blaizzy

This is very helpful. Many thanks @Blaizzy ! Eagerly looking forward to the overhauled trainer.

Apr 04 '25 12:04 sachinraja13

My pleasure :)

It won't be long now

Apr 04 '25 12:04 Blaizzy

@sachinraja13 You can already fine-tune it using the #261 however training the vision towers only woks when your training the full weights, LoRA support for the vision towers will come too, but as @Blaizzy said, you only train the text part, except when its foundation training.

Apr 04 '25 17:04 Goekdeniz-Guelmez

@Goekdeniz-Guelmez Thank you so much for your contribution.

Quick question: For optimizing memory utilization, would quantized full weights fine tuning be supported?

Apr 04 '25 18:04 sachinraja13

Yess definitely!!!

Apr 04 '25 18:04 Goekdeniz-Guelmez