VILA
VILA copied to clipboard
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
LongViLa-LLama3-1024Frames output is often repetitive. Why does this happen, and are there any suggestions to reduce the repetition?
How can I modify this to be used for video querying and description?? ```from openai import OpenAI client = OpenAI( base_url="http://localhost:8000", api_key="fake-key", ) response = client.chat.completions.create( messages=[ { "role": "user",...
I run this script vila-infer \ --model-path /NVILA-8B-Video \ --conv-mode auto \ --text "Please describe the video" \ --media https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4 but got this error message input = media_embeds[name].popleft() IndexError: pop...
Hi, I am using NVILA-Lite-8B-stage2 to finetune on my downstream task. The input has 8 images at most, 3 images at least. But I found that 7*A100 with zero2 can't...
I am very confused about the models here. https://huggingface.co/collections/Efficient-Large-Model/nvila-674f8163543890b35a91b428
Hello, I am running `serving/server.py` with NVILA-Lite-8B and using openAI API to retrieve chat completions as done in [query_nvila.py](https://github.com/NVlabs/VILA/blob/main/serving/query_nvila.py). Now I want to enforce structured output, but I get: `Error...
Hi There I am trying to fine tune VILA1.5-3b model with a custom labeled dataset. I am using a well resourced cluster with 2 A100 GPUs and 100GB RAM on...
Hello everyone, thanks for this amazing work! I'm trying to run inference using NVILA-8B model on NVIDIA V100 GPU but facing issue. I understand from the model requirements that NVILA...
Hello everyone, thanks for sharing this work. I am trying to benchmark it using a different dataset/task. For now, I am more concerned about the latency numbers.  I am...
I am trying to start NVILA server for 15B, but it has lots of bugs and the latest one is not able to take text and image together. I see...