VILA issues

About perception testset

3

Hello authors, Thanks for sharing fantastic jobs. Now I would like to ask where this dataset came from, can you share a link or data? "/lustre/fsw/portfolios/nvr/projects/nvr_elm_llm/dataset/video_datasets_v2/perception_test/"

mary-0830

Easy backwards compatibility fix

4

Your version of transformers forces LlamaFlashAttention2 in the constructor of LlamaDecoderLayer in transformers/models/llama/modeling_llama.py which requires Ampere or newer to work. Just by using the old LlamaAttention class instead of LlamaFlashAttention2...

michael-heinrich

Add support for GPUs with compute capability lower than 8.0 for awq/kernels installation

I tried to install and run the project on a machine with an NVIDIA Tesla T4 GPU, which has a compute capability of 7.5 (SM 75). Environment Ubuntu 22.04 with...

rahulthakur319

Fix for backwards compatibility

Choose LlamaAttention instead of LlamaFlashAttention2, if flash attention is not supported by the GPU architecture. PR for #41

michael-heinrich

Cannot correctly recognize <im_patch>

1

When running codes on jetson, I found that the tokenizer (tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_fast=False)) cannot correctly convert the LLAVA_DEFAULT_IMAGE_PATCH_TOKEN, i.e., , into the index in the vocabulary. That is, the...

m2408gj

What're the modifications in `llava/train/transformers_replace`?

7

Hi, thanks for the nice work! I wonder what are the main modifications in `llava/train/transformers_replace` compared to the original implementation in `transformers==4.31.0`, as specified in the pyproject.toml. Also, in environment_setup.sh,...

ys-zong

Model checkpoints before supervised fine-tuning

Thank you for your fantastic work. Is there any plan to release the pre-trained model checkpoints before SFT for both VILA7B and VILA13B? It would be helpful to evaluate its...

CRazorback

Question on Multi-Image Input Processing During Training

I encountered confusion while reading the code for handling multi-image inputs, particularly in the following sections: https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L127 The nested for loops starting at https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L168 and https://github.com/Efficient-Large-Model/VILA/blob/ef662c84fe7e34101184ceab310fc41f837084b4/llava/model/llava_arch.py#L198 seem to iterate over...

gaozhihan

Multi-image is worse than concat them as single image.

2

Hello, I have tried your code and pretrained models, its a very excellent work. But I meet a issue about multi-image task. My input single image width: height = 1...

liuweijie19980216

[Feature Request] Evaluation tools of the Few-shot VQA/Caption

6

Hi, I'm interested in your great work. The `./scripts/v1_5/eval/eval_all.sh` is not avalilable now. Could you release the evaluation tools? **Especially the few-shot VQA/Caption.** And the mmc4 pretrained weight is wished...

Li-Qingyun

VILA
VILA copied to clipboard

Metadata

About perception testset

Easy backwards compatibility fix

Add support for GPUs with compute capability lower than 8.0 for awq/kernels installation

Fix for backwards compatibility

Cannot correctly recognize <im_patch>

What're the modifications in `llava/train/transformers_replace`?

Model checkpoints before supervised fine-tuning

Question on Multi-Image Input Processing During Training

Multi-image is worse than concat them as single image.

[Feature Request] Evaluation tools of the Few-shot VQA/Caption

← Metadata

Owner

Metadata

VILA VILA copied to clipboard

Metadata

← Metadata

Owner

Metadata

VILA
VILA copied to clipboard