VILA
VILA copied to clipboard
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Hello authors, Thanks for sharing fantastic jobs. Now I would like to ask where this dataset came from, can you share a link or data? "/lustre/fsw/portfolios/nvr/projects/nvr_elm_llm/dataset/video_datasets_v2/perception_test/"
Your version of transformers forces LlamaFlashAttention2 in the constructor of LlamaDecoderLayer in transformers/models/llama/modeling_llama.py which requires Ampere or newer to work. Just by using the old LlamaAttention class instead of LlamaFlashAttention2...
I tried to install and run the project on a machine with an NVIDIA Tesla T4 GPU, which has a compute capability of 7.5 (SM 75). Environment Ubuntu 22.04 with...
Choose LlamaAttention instead of LlamaFlashAttention2, if flash attention is not supported by the GPU architecture. PR for #41
When running codes on jetson, I found that the tokenizer (tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_fast=False)) cannot correctly convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN, i.e., , into the index in the vocabulary. That is, the...
Hi, thanks for the nice work! I wonder what are the main modifications in `llava/train/transformers_replace` compared to the original implementation in `transformers==4.31.0`, as specified in the pyproject.toml. Also, in environment_setup.sh,...
Thank you for your fantastic work. Is there any plan to release the pre-trained model checkpoints before SFT for both VILA7B and VILA13B? It would be helpful to evaluate its...
I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections: https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127 The nested for loops starting at https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168 and https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198 seem to iterate over...
Hello, I have tried your code and pretrained models, its a very excellent work. But I meet a issue about multi-image task. My input single image width: height = 1...
Hi, I'm interested in your great work. The `./scripts/v1_5/eval/eval_all.sh` is not avalilable now. Could you release the evaluation tools? **Especially the few-shot VQA/Caption.** And the mmc4 pretrained weight is wished...