InternVL
InternVL copied to clipboard
Multi-image conversation does not work for more than 2 images?
Hello, I tried using the multi-image conversation as outlined on https://github.com/OpenGVLab/InternVL/blob/764fdc9f3ee102bc6c2def02c2d0ca1e94336d06/README.md?plain=1#L627-L634 With the two image example, I am able to reproduce the results seen in (https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5#model-usage). However, when I try three or more images, the model seems to ignore all but the last two images. For example, if I try:
# pixel_values1, 2, 3, correspond to images 1, 2, 3 in the examples folder
pixel_values = torch.cat((pixel_values1, pixel_values2, pixel_values3), dim=0)
question = "Can you construct a story from all the images?"
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, ': ', response)
and I see the following response:
Can you construct a story from all the images? : The story begins with a young girl named Lily who has always been fascinated by animals. She dreams of one day having a pet panda, just like the ones she sees in the zoo. One day, her wish comes true when she finds a baby panda in the forest. She takes the panda home and names it Panda.
As Panda grows, Lily realizes that she needs a friend for Panda. She decides to adopt a cat from the local animal shelter. She brings home a beautiful calico cat and names her Luna. Luna and Panda become the best of friends, spending their days playing and exploring.
One day, Lily decides to take Luna and Panda to the zoo to see the other animals. They visit the panda exhibit and see a majestic adult panda. Panda is amazed by the size and beauty of the adult panda and wishes to be as big and strong as it.
Luna, being the wise cat that she is, tells Panda that size and strength are not the most important things. She tells Panda that what truly matters is the love and friendship they share with Lily and each other.
It seems that there is no mention of the red panda in image 1. Is there anything I can do to run multi-image conversation for more than 2 images? Or is this a restriction the model currently has?