InternVL
InternVL copied to clipboard
Unexpected output of InternVL 1.5 when using two images as input.
Hi, Thanks for your brilliant work! When I try to use two images as input and compare them, the model output unexpected results such as:
- Based on your instructions, I am to compare the two images. However, as I am an AI text model and I am not capable of viewing or comparing images....
- It appears that there is no second image provided for comparison....
- Without being able to view the images directly....
- I'm sorry, I cannot compare the two images side by side as I can only process and describe the content of one image at a time.... However, sometimes it goes correctly and output comparison results. May I ask what causes these problems?
when using two images as input, you should concat the pixel_value like this
pixel_values = []
for image in images:
pixel_values.append(load_image(image, max_num=6).to(torch.bfloat16).cuda())
pixel_values = torch.cat(pixel_values, dim=0)
Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.
Thanks for your reply, this is exactly what I'm using in my code. I also tried a more direct way, that is to ask internvl "How many images are there?" If I torch.cat two similar images, it answers "there is one image in the image." If I torch.cat many images, it may answers two images, which is far from correct answer.
I also ran into this problem, when feeding multiple images together, the model couldn't seem to determine how many images there were.
Due to using only single-image samples during training, the model's responses are not very stable when dealing with multiple images. We plan to collect some multi-image QA data to enhance this feature in the next release.