Does MLX-VLM support fine-tuning of vision tower layers?
Hi @Blaizzy , thank you for your amazing contribution to the MLX community.
Quick question please: Does MLX-VLM support fine-tuning of vision layers? From what I see in this piece of code, it appears that LoRa linear layers are applied exclusively to the language model weight matrices. I am curious as to why vision-LoRa layers were not incorporated for projection and FFN modules.
Also, please correct me if I'm wrong: If I need to fine-tune the vision layers as well, can I use the same logic as I see here by adding model.vision_tower to the list of modules as well for LoRa fine-tuning?
Once again, many thanks for your amazing work!
Hey @sachinraja13
My pleasure, it means a lot!
Not yet, but we will add it along side the projector. Right now, @Goekdeniz-Guelmez is actively working on doing an overhaul on the trainer 🔥.
However, 99% of the time you don't need to finetune the vision tower VLM. There is vast amount of research on it.
Unless you want to train a VLM from scratch or on out-of-domain data.
Yes, you could do that.
You would then train a vision adapter that would need to load.
This is very helpful. Many thanks @Blaizzy ! Eagerly looking forward to the overhauled trainer.
My pleasure :)
It won't be long now
@sachinraja13 You can already fine-tune it using the #261 however training the vision towers only woks when your training the full weights, LoRA support for the vision towers will come too, but as @Blaizzy said, you only train the text part, except when its foundation training.
@Goekdeniz-Guelmez Thank you so much for your contribution.
Quick question: For optimizing memory utilization, would quantized full weights fine tuning be supported?
Yess definitely!!!