InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

Unexpected output of InternVL 1.5 when using two images as input.

Open lbc12345 opened this issue 9 months ago • 3 comments

Hi, Thanks for your brilliant work! When I try to use two images as input and compare them, the model output unexpected results such as:

  1. Based on your instructions, I am to compare the two images. However, as I am an AI text model and I am not capable of viewing or comparing images....
  2. It appears that there is no second image provided for comparison....
  3. Without being able to view the images directly....
  4. I'm sorry, I cannot compare the two images side by side as I can only process and describe the content of one image at a time.... However, sometimes it goes correctly and output comparison results. May I ask what causes these problems?

lbc12345 avatar May 02 '24 04:05 lbc12345

when using two images as input, you should concat the pixel_value like this

pixel_values = []
for image in images:
  pixel_values.append(load_image(image, max_num=6).to(torch.bfloat16).cuda())

pixel_values = torch.cat(pixel_values, dim=0)

hjh0119 avatar May 08 '24 08:05 hjh0119

Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.

lbc12345 avatar May 09 '24 07:05 lbc12345

Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.

I also ran into this problem, when feeding multiple images together, the model couldn't seem to determine how many images there were.

aabbc-cell avatar May 10 '24 08:05 aabbc-cell

Due to using only single-image samples during training, the model's responses are not very stable when dealing with multiple images. We plan to collect some multi-image QA data to enhance this feature in the next release.

czczup avatar May 30 '24 14:05 czczup