VILA what is the difference between "NVILA-Lite", "NVILA" and "NVILA-video"?

I am very confused about the models here.

https://huggingface.co/collections/Efficient-Large-Model/nvila-674f8163543890b35a91b428

Jan 26 '25 05:01 MengHao666

Hi, please refer to #167 for details and we will update this in our next version of the paper.

Jan 28 '25 05:01 bfshi

Hi, please refer to #167 for details and we will update this in our next version of the paper.

would you provide instructions about how to fintune the model on custom data.

Jan 28 '25 08:01 MengHao666

For sure.

All the training scripts are listed here: https://github.com/NVlabs/VILA#training

All you need to do before running these scripts is preparing the data. For custom data (using single-image QA data as an example here), it should be formatted into a json file that looks like

[
    {
        "id": "1",
        "image": <relative_path_to_image_under_its_root_folder>,
        "conversations": [
            {
                "from": "human",
                "value": "What can you see in the image"
            },
            {
                "from": "gpt",
                "value": "In the center of the image, I can see..."
            }
        ]
    },
    ...
]

Once you have the json file, you can add this dataset into llava/data/registry/datasets/default.yaml by adding an entry in the file that looks like:

<dataset_name>:
    _target_: llava.data.LLaVADataset
    data_path: <path_to_json_file>
    media_dir: <image_root_folder_path>

Now you are ready to go! Simple run the training script with the dataset you would like to train on indicated in the script. Dataset names are concatenated with +. For example, if you train on three datasets, then the script looks like this:

bash scripts/NVILA-Lite/align.sh Efficient-Large-Model/Qwen2-VL-7B-Instruct <dataset_name1>+<dataset_name2>+<dataset_name3>

And that's it!

Feb 04 '25 04:02 bfshi

https://github.com/NVlabs/VILA#training

thanks，I will try to fintune it on custom data and compare the result with other open-source models.

Feb 04 '25 10:02 MengHao666

sg, would love to see how the results will look like and please feel free to let me know if there's any other questions.

Feb 04 '25 21:02 bfshi

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md

In case it's helpful.

Feb 11 '25 01:02 bfshi

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md

In case it's helpful.

Wonderful！It will improve the influence of NVILA model series. I will have a try.

Feb 11 '25 01:02 MengHao666

In my finetuning practice, I need to modify the code around this area from "config._name_or_path" to "config.model_name_or_path". otherwise, the code would fail

Feb 28 '25 09:02 MengHao666

Hi @MengHao666, which NVILA model are you finetuning on?

Feb 28 '25 20:02 bfshi

finetuning on?

Efficient-Large-Model/NVILA-8B-Video

Mar 03 '25 02:03 MengHao666

I suggest this model to support read video in fps setting to comprehend the time dynamics in the future. Comparing with Qwen2.5-VL, NVILA model series do not consider time info in training. This may cause that the model could understand about time-related task.

Mar 10 '25 06:03 MengHao666

Can you provide some details examples for comprehend the time dynamics? We are training new models and this topic is on our development plan.

Mar 10 '25 11:03 Lyken17

Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md In case it's helpful.

Wonderful！It will improve the influence of NVILA model series. I will have a try.

Hi, can I check where could I get the LoRA scripts for fine-tuning? In particular, I'm keen on peft on VE and LLM respectively if possible. Thanks!

May 20 '25 14:05 adrielkuek

VILA VILA copied to clipboard

what is the difference between "NVILA-Lite", "NVILA" and "NVILA-video"?

VILA
VILA copied to clipboard