unsloth Finetuning multimodal vision models? (Llava, and BakLLaVA)

Finetuning multimodal vision models? (Llava, and BakLLaVA)

Open babycommando opened this issue 4 months ago • 6 comments

Hey unsloth team, beautiful work being done here.

I am the author of MachinaScript for Robots - a framework for building LLM-powered robots in your garage!

The LLM basically outputs a JSON-like set of instructions for actions, movements and skill usages that are then parsed by a raspberry pi and serialized to an arduino to be executed. I am using unsloth for training a model that outputs this synthax so we can have smaller system prompts and faster execution for the robot.

However these are for receiving text instructions - there's no vision related so it makes it difficult to make a fully self operating robot out of it.

The project was initially based on GPT-4-V however with the great multimodal open models out there like Obsidian, Llava, and BakLLaVA the world of llm-powered robots is ready to take a great leap forward. I would love to plan a dataset and finetune a vision model to output machinascript synthax code using the awesome capabilities of unsloth. Is it possible or will it be in the future to finetune multimodal llms?

Feb 06 '24 17:02 babycommando

@babycommando Hey thanks for the cool request! Super cool repo as well! And super interesting you're finetuning to output instructions then actually executing them!! That's super cool!!!

Hmm currently vision models can get more complex. Technically in the Llava paper, a LLM is first used, then the final layer is projected into a new feature space then a Vision Encoder is used:

So in theory the LLM part can be optimized with Unsloth, and the rest can be optimized at a later date. I just haven't gotten time to work on vision + LLM type models, but we will do so at a later date :)

Again super cool project!

Feb 07 '24 01:02 danielhanchen

Hey Daniel, sorry for the delay! I went on a deep research on the finetuning of multimodal models, and turns out LLaVA repo already provides most of the things we need to get started.

Would be so cool if we could borrow Unsloth awesome capabilities to execute it.

This is an official doc for finetuning LLaVA: https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md

it mentions that:

if you have a lot of data use this script (12 hours on 8xA100's): https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh
if you don't have a lot of data use this script (a few hours and a single A100): https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task.sh

(try to also take a look at the whole /scripts directory)

Also the dataset format is the "share gpt" as mentioned in the doc. This is the dataset they used to make llava, for anyone else wondering how the finetuning dataset should be formatted: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K

About Obsidian 3B, after some talk with one of the core engineers it is clear that is a version of LLaVA using a strain of Zephyr3B underneath. It is by the same people who made Hermes (Nous).

They recommend using this script for finetuning with a big dataset: https://github.com/NousResearch/Obsidian/blob/main/scripts/finetune.sh
And this one for smaller datasets: https://github.com/NousResearch/Obsidian/blob/main/scripts/finetune_qlora.sh

So, I hope this brings up some light for the integration of multimodal training. They both seems to be using deepspeed, I myself haven't tried it yet. Would love to use Unsloth for this!

And again, thank you so much for the interest in MachinaScript, free the robots!!!!

Feb 11 '24 08:02 babycommando

@babycommando Thanks for the writeup! Super useful and wonderful insights! :) I will check these all out in the following days! :)) Hopefully Unsloth will have support for LlaVa type models in the near future :))

Feb 11 '24 08:02 danielhanchen

Can't wait to see it implemented. Thanks.

Feb 21 '24 06:02 oliverbob

I am currently exploring the qnguyen3/nanoLLaVA model, which it is built on top of Quyen-SE-v0.1 (Qwen1.5-0.5B) and incorporates Google SigLIP-400M.

Would there be support for Colab or Kaggle version fine-tuning for qnguyen3/nanoLLaVA?

Thank you for making unsloth project open-source. I am eagerly looking forward to seeing its implementation.

Here are the links to nanoLLaVA project:

https://huggingface.co/qnguyen3/nanoLLaVA

https://github.com/qnguyen3/nanoLLaVA

Apr 22 '24 15:04 linshi1111

Hmm Llava probs for a future release

Apr 23 '24 17:04 danielhanchen

unsloth unsloth copied to clipboard

Finetuning multimodal vision models? (Llava, and BakLLaVA)

unsloth
unsloth copied to clipboard