LLaVA-NeXT issues

vision tower参数问题

8

看原文应该是tune了vision tower的，但是在lmms-lab/llava-onevision-qwen2-7b-ov的config.json中，有`"mm_vision_tower": "google/siglip-so400m-patch14-384"`。看上去是加载了原始的vision tower。这里有个问题，不知道是不是先加载原始的vision tower然后再进行的参数覆盖？参数覆盖的时候有warning: ``` envs/llavaov/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for vision_model.encoder.layers.21.self_attn.k_proj.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a...

liuyijiang1994

Video/Image Processing (padding, channel order)

1

I noticed that when I process a video frame with a standard 16:9 aspect ratio, the processed output frame isn't zero-padded, and the aspect ratio is distorted. Is this intended?...

jmhummel

What is LLaVA-Wild (train) data? It is missing in HF datasets.

LLaVA-OneVision used this `LLaVA-Wild (train)` dataset, but not provided in https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data. Furthermore, the paper refers to original llava paper [83] (see above figure), but this dataset does not match the...

kakao-logan-c

Inconsistent sample numbers in LLaVA-NeXT dataset

Thanks for your great works. I'm downloading LLaVA-NeXT instruction tuning data through [lmms-lab/LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data). However, I find that there are around 779k samples in [parquet directory](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data/tree/main/data) and only 738k samples in...

niiickZ

Does LLaVA-NeXT support 336x336 image inputs, like LLaVA-1.5?

1

Hi there! 😊 First of all, thank you so much for your amazing work on LLaVA-NeXT! I was reading about the performance improvements and how it maintains the minimalist design...

hskim98

Cuda failure 'invalid argument'

7

I'm running finetune_onevision.sh to finetune on my dataset and I get this error: Traceback (most recent call last): File "/home/ubuntu/LLaVA-NeXT/llava/train/train_mem.py", line 4, in train() File "/home/ubuntu/LLaVA-NeXT/llava/train/train.py", line 1672, in train...

FSet89

What is the purpose of the 3 three sh files in script/interleave since we can evaluate using lmms-eval

Dear authors, Thanks for your remarkable work! I'd like to evaluate the LLaVa-OV model on different datasets. So I spotted the three bash files (eval_all.sh, eval_interleave_3d.sh and eval_multiprocess.sh) in scripts/interleave...

Haodi-Liu

Query about the dimension of outputs.attentions

8

Does anyone know why the shape of outputs.attentions[0][-1] is [1, 754, 28, 28] 754 is the total number of token of inputs and current outputs, I wonder what's 28, 28...

sterzhang

When the dataset of training LLaVA-OneVision-Chat open source?

moclimb

How do we turn off Flash attention in LLaVA-NeXT?

Since my server environment does not seem to support Ampere GPU, I have been trying to disable Flash attention. First, I simply brought the [train_xformers.py](https://github.com/haotian-liu/LLaVA/blob/main/llava/train/train_xformers.py) and [llama_xformers_attn_monkey_patch.py](https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llama_xformers_attn_monkey_patch.py) files to my...

Bleking

LLaVA-NeXT
LLaVA-NeXT copied to clipboard

Metadata

vision tower参数问题

Video/Image Processing (padding, channel order)

What is LLaVA-Wild (train) data? It is missing in HF datasets.

Inconsistent sample numbers in LLaVA-NeXT dataset

Does LLaVA-NeXT support 336x336 image inputs, like LLaVA-1.5?

Cuda failure 'invalid argument'

What is the purpose of the 3 three sh files in script/interleave since we can evaluate using lmms-eval

Query about the dimension of outputs.attentions

When the dataset of training LLaVA-OneVision-Chat open source?

How do we turn off Flash attention in LLaVA-NeXT?

← Metadata

Owner

Metadata

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Metadata

← Metadata

Owner

Metadata

LLaVA-NeXT
LLaVA-NeXT copied to clipboard