InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] Potential bug in InternVL3 preprocessor causing one video frame to be dropped for multi-video inputs

Open MercurialBlade opened this issue 3 months ago • 1 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

When using the InternVL3 preprocessor with multiple videos (e.g., 2 videos), there seems to be a potential issue in processing_internvl.py starting at line 131. The problem may cause the second video's frame count to be off by one, which leads to the following error downstream: raise ValueError( ValueError: Image features and image tokens do not match: tokens: 4096, features 3840

Reproduction

from transformers import AutoProcessor, AutoModelForImageTextToText import torch import os os.environ["HTTPS_PROXY"] = "http://127.0.0.1:7890" os.environ["HTTP_PROXY"] = "http://127.0.0.1:7890"

torch_device = "cuda" model_checkpoint = "OpenGVLab/InternVL3-1B-hf" processor = AutoProcessor.from_pretrained(model_checkpoint) model = AutoModelForImageTextToText.from_pretrained( model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

system_prompt = """ You are a helpful multimodal assistant tasked with evaluating the quality of videos generated for a given input caption. Your goal is to determine which video better matches the caption. Choose either Video A or Video B as the better option. Do NOT say both / neither are good. Here are some rules for the evaluation: (1) If the caption does not contain harmful content, prioritize evaluating whether the video faithfully and precisely reflects the caption, then consider its helpfulness, accuracy, level of detail, harmlessness, realism, motion consistency, and overall visual quality. (2) If the caption contains harmful content, prioritize the harmlessness and safety of the video. (3) The video should NOT include elements that are irrelevant to or missing from the caption, as such outputs do NOT precisely execute the instruction. (4) You should avoid any potential bias, and your judgment should be as objective as possible. Here are some potential sources of bias:

  • The order in which the videos are presented should NOT affect your judgment, as Video A and Video B are equally likely to be better.
  • The rendering style (e.g., realistic, cartoonish, cinematic) should NOT affect your judgment unless explicitly specified in the caption.
  • Do not assume that a more visually complex video is necessarily better; evaluate whether the complexity and motion quality are appropriate for the given caption.

Your reply should strictly follow this format: Feedback:

Comparison:

Conclusion: A or B

Here is the data.

"""

video_a_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/aadd660eca2b4d4788195c729920121f.mp4" video_b_path = "https://huggingface.co/datasets/tianleliphoebe/genai-arena-video-mp4/blob/main/86aa85eec53c4b1293ef632e6183fe93.mp4" user_input = "A polar bear is playing guitar\n"

messages = [ { "role": "system", "content": [ {"type": "text", "text": system_prompt} ] }, { "role": "user", "content": [ {"type": "text", "text": "[User Input]\n"}, {"type": "text", "text": user_input}, {"type": "text", "text": "[The Start of Video A]\n"}, {"type": "video", "url": video_a_path.replace( "/blob/", "/resolve/")}, {"type": "text", "text": "[The End of Video A]\n"}, {"type": "text", "text": "[The Start of Video B]\n"}, {"type": "video", "url": video_b_path.replace( "/blob/", "/resolve/")}, {"type": "text", "text": "[The End of Video B]\n"}, ], }, ] inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", num_frames=8 ).to(model.device, dtype=torch.bfloat16) generate_ids = model.generate(**inputs, max_new_tokens=1000) decoded_output = processor.decode( generate_ids[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(decoded_output)

Environment

transformers == 4.56.1
PyTorch == 2.8.0+cu126

Error traceback


MercurialBlade avatar Sep 16 '25 15:09 MercurialBlade

Thank you very much for your suggestion. This bug has been fixed through this PR: https://github.com/huggingface/transformers/pull/41121

WesKwong avatar Sep 24 '25 09:09 WesKwong