VILA icon indicating copy to clipboard operation
VILA copied to clipboard

VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.

Results 141 VILA issues
Sort by recently updated
recently updated
newest added

Hello authors, Thanks for sharing fantastic jobs. Now I would like to ask where this dataset came from, can you share a link or data? "/lustre/fsw/portfolios/nvr/projects/nvr_elm_llm/dataset/video_datasets_v2/perception_test/"

Your version of transformers forces LlamaFlashAttention2 in the constructor of LlamaDecoderLayer in transformers/models/llama/modeling_llama.py which requires Ampere or newer to work. Just by using the old LlamaAttention class instead of LlamaFlashAttention2...

I tried to install and run the project on a machine with an NVIDIA Tesla T4 GPU, which has a compute capability of 7.5 (SM 75). Environment Ubuntu 22.04 with...

Choose LlamaAttention instead of LlamaFlashAttention2, if flash attention is not supported by the GPU architecture. PR for #41

When running codes on jetson, I found that the tokenizer (tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_fast=False)) cannot correctly convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN, i.e., , into the index in the vocabulary. That is, the...

Hi, thanks for the nice work! I wonder what are the main modifications in `llava/train/transformers_replace` compared to the original implementation in `transformers==4.31.0`, as specified in the pyproject.toml. Also, in environment_setup.sh,...

Thank you for your fantastic work. Is there any plan to release the pre-trained model checkpoints before SFT for both VILA7B and VILA13B? It would be helpful to evaluate its...

I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections: https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127 The nested for loops starting at https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168 and https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198 seem to iterate over...

Hello, I have tried your code and pretrained models, its a very excellent work. But I meet a issue about multi-image task. My input single image width: height = 1...

Hi, I'm interested in your great work. The `./scripts/v1_5/eval/eval_all.sh` is not avalilable now. Could you release the evaluation tools? **Especially the few-shot VQA/Caption.** And the mmc4 pretrained weight is wished...