Some questions about the test results of LLaVA-OneVision (multi-images input)and LLaVA-Video (video input)

Open zhousheng97 opened this issue 6 months ago • 0 comments

Hello, when I tested LLaVA-OneVision and LLaVA-Video, I found that the results of LLaVA-OneVision were unexpectedly poor. Is there anything I did not set correctly?

The prompt of LLaVA-OneVision is:

image_prompt = f"{DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} These are three consecutive video frames."
question_prompt = "Please answer the following questions related to the video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
prompt = f"{image_prompt}\n" + f"{question_prompt}\n"

The prompt of LLaVA-Video is:

time_instruciton = "Please answer the following questions related to this video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\n" + question

The whole process code of LLaVA-OneVision is:

def predict(frames_path, question):

    # Step 1: Load video frames
    # Load key images from local paths
    images = [Image.open(path) for path in frames_path]

    image_tensors = process_images(images, image_processor, model.config)
    image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]

    # Step 2: Prepare the question prompt
    conv_template = "qwen_1_5"  
    image_prompt = f"{DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} These are three consecutive video frames."
    question_prompt = "Please answer the following questions related to the video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
    prompt = f"{image_prompt}\n" + f"{question_prompt}\n"
    question = prompt + question

    conv = copy.deepcopy(conv_templates[conv_template])
    conv.append_message(conv.roles[0], question)
    conv.append_message(conv.roles[1], None)
    prompt_question = conv.get_prompt()
    print(f"prompt_question: {prompt_question}")

    input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
    image_sizes = [image.size for image in images]

    # Step 3: Inference using the model
    # with torch.no_grad():  # Disable gradients for inference to save memory
    # Generate response
    cont = model.generate(
        input_ids,
        images=image_tensors,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0,
        max_new_tokens=4096,
    )

    # Decode the output and clean up
    text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
    
    return text_outputs, prompt

Here is two test examples:

Case 1:

The input question is: "what is the game we're about to play?" The correct answer is: "sorry!"

LLaVA-OneVision input is the following three images:

LLaVA-Video input is the corresponding entire video (3-minute video from Ego4D).

Finally, the outputs of the two models are: "llava-video": "sorry", "llava-onevision": "solitaire"

Case 2:

The input question is: "which building am i approaching?" The correct answer is: "united air lines"

LLaVA-OneVision input is the following three images:

LLaVA-Video input is the corresponding entire video (3-minute video from Ego4D).

Finally, the outputs of the two models are: "llava-video": "united air lines", "llava-onevision": "unanswerable"

Logically, LLaVA-OneVision processes more tokens (729 tokens per frame) when given multi-image input compared to video input (196 tokens per frame), and videos typically contain more redundant information. However, based on the results of these two examples, the outcomes don’t align with what I initially expected. Could this be due to an issue with how I designed the prompt? If so, I’d really appreciate it if anyone could point out the problem.

May 12 '25 14:05 zhousheng97

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Some questions about the test results of LLaVA-OneVision (multi-images input)and LLaVA-Video (video input)

Case 1:

Case 2:

LLaVA-NeXT
LLaVA-NeXT copied to clipboard