LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

Error occurs when single and multi-image inputs are included in the same batch

Open seokwon99 opened this issue 4 months ago • 1 comments

When passing both single-image and multi-image inputs in the same batch, the following error occurs:

RuntimeError: Tensors must have same number of dimensions: got 2 and 1

Is there any solution?

Reproduction code

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

# Load the model in half-precision
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")

img_path = "{your_image}"
messages = [
    [
        {
            "role": "user",
            "content": [ # QA 1
                {"type": "image_url", "image_url": img_path}},
                {"type": "text", "text": "Text."},
            ]
        }
    ],
    [
        {
            "role": "user",
            "content": [ # QA 2
                {"type": "image_url", "image_url": {"url": img_path}},
                {"type": "image_url", "image_url": {"url": img_path}},
                {"type": "text", "text": "Text."},
            ]
        }
    ],
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    padding=True,
    return_tensors="pt"
).to(model.device, torch.float16)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)

seokwon99 avatar Jul 17 '25 19:07 seokwon99