VILA icon indicating copy to clipboard operation
VILA copied to clipboard

VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.

Results 92 VILA issues
Sort by recently updated
recently updated
newest added

- Where can I found the paper for VILA-1.5? - What visual encodes and LMs are used for VILA 1.5 3B, 8B, 13B, and 40B?

The youcook2 data repository (http://youcook2.eecs.umich.edu/download) only provides a script to download the raw videos into a folder `.../youcook2/raw_videos/`. However, the entries in the `youcook_filtered_v3.json` file has entries like ``` {...

Hi, looks like VILA trained a lot of videos, how does the being sampled? And how does it dealed with S2?

I think if using s2, and unfreeze vit, the result could be worse, as the s2 split images.

The way you using actually feed 5 images into vit, how's it compare with interpolate to 768x768 which equal to send 4 images into vit but with different manner?

In https://github.com/Efficient-Large-Model/VILA/blob/main/llava/data/datasets_mixture.py#L171C5-L171C6 the math dataset is described as type 'vflan'. However, in `data_prepare/README.md` it isn't clear what corresponds to that. I'm guessing it is GSM8K-ScRel-SFT. But the format of the...

Hello! Thanks for sharing such a nice project. I have set up environment following the instructions in ReadME. When I run the inference example as the following ( i have...

Very excellent work! When using lora to train a 40b model in my task, I found during the loading inference process that lora did not save the weight of the`...

when run this script ``` python -W ignore llava/eval/run_vila.py \ --model-path Efficient-Large-Model/VILA1.5-3b \ --conv-mode vicuna_v1 \ --query "\n Please describe this video." \ --video-file "demo.mp4" ``` with disable #from tf_utils...

I get the following error while running `llava/eval/run_vila.py` on a H100 gpu: ``` root@7513903dd8b0:/src/VILA# python -W ignore llava/eval/run_vila.py --model-path Efficient-Large-Model/VILA1.5-3b --conv-mode vicuna_v1 --query "\n Please describe this video." --video-file "tjx1PPFsa6A-Scene-049.mp4"...