VILA
VILA copied to clipboard
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
I noticed that NVILA has three versions: Base, Lite, and Video. What are the differences between them, and how does NVILA-15B perform in video tasks, such as the test results...
I have seen from a previous issue, that it was able to reason among multiple images (see: https://github.com/NVlabs/VILA/issues/20) I wanted to try this with vila-infer aswell, however, if I use...
I am running inference with the Efficient-Large-Model/VILA1.5-13b model. When using the Efficient-Large-Model/VILA1.5-3b and Efficient-Large-Model/Llama-3-VILA1.5-8B models, the results are generated correctly without any issues. However, when running inference with the 13B...
[2024-12-18 17:36:31,349] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO: Started server process [3865832] INFO: Waiting for application startup. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:01
Hi, I wonder what is the conv_mode for VILA1.5-40b in video inference? Additionally, I noted that the \ token seems invalid in video inference. The eval codes will automatically add...
For quantizing the llm part of VILA, I would like to know why AWQ was chosen instead of GPTQ. Have you tried using GPTQ to quantize the LLM part? AWQ...

Argument order is different in LLaVA's function, so I updated it so that it doesn't matter which order the arguments are in.
I've encountered a persistent issue while running the Gradio demo: "Gradio demo: VILA with TinyChat" on a local server, despite following the steps here: [GitHub Link](https://github.com/mit-han-lab/llm-awq/tree/main/tinychat/serve). **Problem:** The model fails...
I want to start training my own fine-tuning dataset from the stage 2 of VILA1.5-3b. I noticed in `3_sft.sh` that there is a comment for the output of the stage...