VILA
VILA copied to clipboard
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Hi, When you do Sequence Paralle -- you are padding with token id 2 = '#' https://github.com/NVlabs/VILA/blob/2b43308f25e63161a172fe9a38e3a04e2fcd12ef/llava/data/dataset.py#L1372-L1389 Could you let me know why you are padding with this instead of...
The current DataCollatorForSupervisedDatasetSeqParallel in llava/data/dataset.py is built for image dataset. There will be many errors when directly using it for video dataset. Will you release the similar solution for video...
I want to train a multimodal video understanding model. What should I do? I find the NVILA-15B model supports video inference.
Hello, Author. When I changed the question in the "searching for a needle in the haystack" evaluation from one about the needle to a different question (for example, "please describe...
When I evaluated NVILA-8B-Video on lmms-longvideobench with this script: ```bash #!/bin/bash set -e MODEL_NAMES=( "NVILA-8B-Video" ) SELECTED_TASKS=( "lmms-longvideobench_val_v" ) TASK_STR=$( IFS=, echo "${SELECTED_TASKS[*]}" ) echo "TASK_STR: $TASK_STR" START_TIME=$(date +%s) echo...
Replacing `+=` with `text_embeds = text_embeds + (...)` avoids the "leaf Variable that requires grad is being used in an in-place operation" RuntimeError in PyTorch.
Hello Author, I have a question regarding the understanding of the code. In the eval_forward function, I noticed that the code concatenates answer_embeds with input_embeds and then feeds the combined...
When using LongViLa-R1 for video summarization, I encountered an issue where one video chunk took an abnormally long time to process, resulting in a large summary with significant repetition. Model:...
Hello VILA team! First, thank you for open-sourcing this incredible family of Vision Language Models! The work on VILA, NVILA, and is truly impressive, and the focus on efficiency and...