VILA
VILA copied to clipboard
what is the difference between "NVILA-Lite", "NVILA" and "NVILA-video"?
I am very confused about the models here.
https://huggingface.co/collections/Efficient-Large-Model/nvila-674f8163543890b35a91b428
Hi, please refer to #167 for details and we will update this in our next version of the paper.
Hi, please refer to #167 for details and we will update this in our next version of the paper.
would you provide instructions about how to fintune the model on custom data.
For sure.
All the training scripts are listed here: https://github.com/NVlabs/VILA#training
All you need to do before running these scripts is preparing the data. For custom data (using single-image QA data as an example here), it should be formatted into a json file that looks like
[
{
"id": "1",
"image": <relative_path_to_image_under_its_root_folder>,
"conversations": [
{
"from": "human",
"value": "What can you see in the image"
},
{
"from": "gpt",
"value": "In the center of the image, I can see..."
}
]
},
...
]
Once you have the json file, you can add this dataset into llava/data/registry/datasets/default.yaml by adding an entry in the file that looks like:
<dataset_name>:
_target_: llava.data.LLaVADataset
data_path: <path_to_json_file>
media_dir: <image_root_folder_path>
Now you are ready to go! Simple run the training script with the dataset you would like to train on indicated in the script. Dataset names are concatenated with +. For example, if you train on three datasets, then the script looks like this:
bash scripts/NVILA-Lite/align.sh Efficient-Large-Model/Qwen2-VL-7B-Instruct <dataset_name1>+<dataset_name2>+<dataset_name3>
And that's it!
https://github.com/NVlabs/VILA#training
thanks,I will try to fintune it on custom data and compare the result with other open-source models.
sg, would love to see how the results will look like and please feel free to let me know if there's any other questions.
Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md
In case it's helpful.
Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md
In case it's helpful.
Wonderful!It will improve the influence of NVILA model series. I will have a try.
In my finetuning practice, I need to modify the code around this area from "config._name_or_path" to "config.model_name_or_path". otherwise, the code would fail
Hi @MengHao666, which NVILA model are you finetuning on?
finetuning on?
Efficient-Large-Model/NVILA-8B-Video
I suggest this model to support read video in fps setting to comprehend the time dynamics in the future. Comparing with Qwen2.5-VL, NVILA model series do not consider time info in training. This may cause that the model could understand about time-related task.
Can you provide some details examples for comprehend the time dynamics? We are training new models and this topic is on our development plan.
Hi @MengHao666, we've posted more detailed instructions on how to train NVILA with custom data: https://github.com/NVlabs/VILA/blob/main/finetuning/README.md In case it's helpful.
Wonderful!It will improve the influence of NVILA model series. I will have a try.
Hi, can I check where could I get the LoRA scripts for fine-tuning? In particular, I'm keen on peft on VE and LLM respectively if possible. Thanks!