LLaVA-NeXT
LLaVA-NeXT copied to clipboard
Some questions about the test results of LLaVA-OneVision (multi-images input)and LLaVA-Video (video input)
Hello, when I tested LLaVA-OneVision and LLaVA-Video, I found that the results of LLaVA-OneVision were unexpectedly poor. Is there anything I did not set correctly?
The prompt of LLaVA-OneVision is:
image_prompt = f"{DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} These are three consecutive video frames."
question_prompt = "Please answer the following questions related to the video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
prompt = f"{image_prompt}\n" + f"{question_prompt}\n"
The prompt of LLaVA-Video is:
time_instruciton = "Please answer the following questions related to this video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\n" + question
The whole process code of LLaVA-OneVision is:
def predict(frames_path, question):
# Step 1: Load video frames
# Load key images from local paths
images = [Image.open(path) for path in frames_path]
image_tensors = process_images(images, image_processor, model.config)
image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
# Step 2: Prepare the question prompt
conv_template = "qwen_1_5"
image_prompt = f"{DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} {DEFAULT_IMAGE_TOKEN} These are three consecutive video frames."
question_prompt = "Please answer the following questions related to the video. If you cannot answer the question, please answer 'Unanswerable' and briefly explain why you cannot answer. Keep your answer as short as possible."
prompt = f"{image_prompt}\n" + f"{question_prompt}\n"
question = prompt + question
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
print(f"prompt_question: {prompt_question}")
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size for image in images]
# Step 3: Inference using the model
# with torch.no_grad(): # Disable gradients for inference to save memory
# Generate response
cont = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
# Decode the output and clean up
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
return text_outputs, prompt
Here is two test examples:
Case 1:
The input question is: "what is the game we're about to play?" The correct answer is: "sorry!"
LLaVA-OneVision input is the following three images:
LLaVA-Video input is the corresponding entire video (3-minute video from Ego4D).
Finally, the outputs of the two models are: "llava-video": "sorry", "llava-onevision": "solitaire"
Case 2:
The input question is: "which building am i approaching?" The correct answer is: "united air lines"
LLaVA-OneVision input is the following three images:
LLaVA-Video input is the corresponding entire video (3-minute video from Ego4D).
Finally, the outputs of the two models are: "llava-video": "united air lines", "llava-onevision": "unanswerable"
Logically, LLaVA-OneVision processes more tokens (729 tokens per frame) when given multi-image input compared to video input (196 tokens per frame), and videos typically contain more redundant information. However, based on the results of these two examples, the outcomes don’t align with what I initially expected. Could this be due to an issue with how I designed the prompt? If so, I’d really appreciate it if anyone could point out the problem.