VILA icon indicating copy to clipboard operation
VILA copied to clipboard

Context size and examples for LongVILA

Open yulinzou opened this issue 1 year ago • 1 comments

Hello,

I'm new to LLM serving and multi-modal LLMs. I'm looking for similar examples for the LongVILA model, like the one for VILA1.5 models:

python -W ignore llava/eval/run_vila.py  --model-path Efficient-Large-Model/VILA1.5-13b    --conv-mode vicuna_v1    --query "<video>\n Describe what happened in the video."   --video-file "./example.mp4"  --num-video-frames 20

Specifically, I'd like to know what conv-mode I should use and the maximum frame number for both LongVILA and VILA1.5 models. I also noticed the paper mentioned a downsampler that can reduce the number of tokens for an image; do you have an example of how to use that?

Thanks!

yulinzou avatar Sep 30 '24 08:09 yulinzou

@yukang2017 can you help confirm context length? I think the conv mode should be llama3.

Lyken17 avatar Nov 19 '24 14:11 Lyken17