InternVL Unexpected output of InternVL 1.5 when using two images as input.

Unexpected output of InternVL 1.5 when using two images as input.

Open lbc12345 opened this issue 9 months ago • 3 comments

Hi, Thanks for your brilliant work! When I try to use two images as input and compare them, the model output unexpected results such as:

Based on your instructions, I am to compare the two images. However, as I am an AI text model and I am not capable of viewing or comparing images....
It appears that there is no second image provided for comparison....
Without being able to view the images directly....
I'm sorry, I cannot compare the two images side by side as I can only process and describe the content of one image at a time.... However, sometimes it goes correctly and output comparison results. May I ask what causes these problems?

May 02 '24 04:05 lbc12345

when using two images as input, you should concat the pixel_value like this

pixel_values = []
for image in images:
  pixel_values.append(load_image(image, max_num=6).to(torch.bfloat16).cuda())

pixel_values = torch.cat(pixel_values, dim=0)

May 08 '24 08:05 hjh0119

Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.

May 09 '24 07:05 lbc12345

Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.

I also ran into this problem, when feeding multiple images together, the model couldn't seem to determine how many images there were.

May 10 '24 08:05 aabbc-cell

Due to using only single-image samples during training, the model's responses are not very stable when dealing with multiple images. We plan to collect some multi-image QA data to enhance this feature in the next release.

May 30 '24 14:05 czczup

InternVL InternVL copied to clipboard

Unexpected output of InternVL 1.5 when using two images as input.

InternVL
InternVL copied to clipboard