VILA icon indicating copy to clipboard operation
VILA copied to clipboard

what is the difference between "NVILA-Lite", "NVILA" and "NVILA-video"?

Open MengHao666 opened this issue 10 months ago • 13 comments

I am very confused about the models here.

https://huggingface.co/collections/Efficient-Large-Model/nvila-674f8163543890b35a91b428

MengHao666 avatar Jan 26 '25 05:01 MengHao666

Hi, please refer to #167 for details and we will update this in our next version of the paper.

bfshi avatar Jan 28 '25 05:01 bfshi

Hi, please refer to #167 for details and we will update this in our next version of the paper.

would you provide instructions about how to fintune the model on custom data.

MengHao666 avatar Jan 28 '25 08:01 MengHao666

For sure.

All the training scripts are listed here: https://github.com/NVlabs/VILA#training

All you need to do before running these scripts is preparing the data. For custom data (using single-image QA data as an example here), it should be formatted into a json file that looks like

[
    {
        "id": "1",
        "image": <relative_path_to_image_under_its_root_folder>,
        "conversations": [
            {
                "from": "human",
                "value": "What can you see in the image"
            },
            {
                "from": "gpt",
                "value": "In the center of the image, I can see..."
            }
        ]
    },
    ...
]

Once you have the json file, you can add this dataset into llava/data/registry/datasets/default.yaml by adding an entry in the file that looks like:

<dataset_name>:
    _target_: llava.data.LLaVADataset
    data_path: <path_to_json_file>
    media_dir: <image_root_folder_path>

Now you are ready to go! Simple run the training script with the dataset you would like to train on indicated in the script. Dataset names are concatenated with +. For example, if you train on three datasets, then the script looks like this:

bash scripts/NVILA-Lite/align.sh Efficient-Large-Model/Qwen2-VL-7B-Instruct <dataset_name1>+<dataset_name2>+<dataset_name3>

And that's it!

bfshi avatar Feb 04 '25 04:02 bfshi

https://github.com/NVlabs/VILA#training

thanks,I will try to fintune it on custom data and compare the result with other open-source models.

MengHao666 avatar Feb 04 '25 10:02 MengHao666

sg, would love to see how the results will look like and please feel free to let me know if there's any other questions.

bfshi avatar Feb 04 '25 21:02 bfshi

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md

In case it's helpful.

bfshi avatar Feb 11 '25 01:02 bfshi

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md

In case it's helpful.

Wonderful!It will improve the influence of NVILA model series. I will have a try.

MengHao666 avatar Feb 11 '25 01:02 MengHao666

In my finetuning practice, I need to modify the code around this area from "config._name_or_path" to "config.model_name_or_path". otherwise, the code would fail

MengHao666 avatar Feb 28 '25 09:02 MengHao666

Hi @MengHao666, which NVILA model are you finetuning on?

bfshi avatar Feb 28 '25 20:02 bfshi

finetuning on?

Efficient-Large-Model/NVILA-8B-Video

MengHao666 avatar Mar 03 '25 02:03 MengHao666

I suggest this model to support read video in fps setting to comprehend the time dynamics in the future. Comparing with Qwen2.5-VL, NVILA model series do not consider time info in training. This may cause that the model could understand about time-related task.

MengHao666 avatar Mar 10 '25 06:03 MengHao666

Can you provide some details examples for comprehend the time dynamics? We are training new models and this topic is on our development plan.

Lyken17 avatar Mar 10 '25 11:03 Lyken17

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md In case it's helpful.

Wonderful!It will improve the influence of NVILA model series. I will have a try.

Hi, can I check where could I get the LoRA scripts for fine-tuning? In particular, I'm keen on peft on VE and LLM respectively if possible. Thanks!

adrielkuek avatar May 20 '25 14:05 adrielkuek