InternVL [Bug] All InternVL models do very poorly with multiple images (even just 3)

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Using lmdeploy to support OpenGVLab/InternVL2-Llama3-76B and functions normally if only give single image.

As soon as multiple images given, OpenGVLab/InternVL2-Llama3-76B becomes blind to some images, say blind to one of the three even though directly asked about that one it was blind on.

I raise this issue because multiple images were touted as a key good thing about the InternVL2 models.

I've tried 1-5, 26B, 40B, and the 76B, all fail basic multi-image questions. No cherry-picking, these 3 images and the prompt come from our very first attempt to use multiple images with these models.

Reproduction

Add base_url etc.

from openai import OpenAI

client = OpenAI()  # add base_url etc.

from PIL import Image
import base64
import requests
from io import BytesIO

# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')


# load image from url
url1 = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"

url2 = "https://enterprise-h2ogpt-public-data.s3.amazonaws.com/receipt.jpg"

url3 = "https://enterprise-h2ogpt-public-data.s3.amazonaws.com/baby_cake.png"

urls = [url1, url2, url3]

base64s = []
for url in urls:
    response = requests.get(url)
    base64_correct = base64.b64encode(response.content).decode('utf-8')
    base64s.append(base64_correct)


prompt = """Pay attention and remember the information below. You will need to use only any chat history, any images given, or any document text in order to answer the question or imperative at the end.
<all_documents>
<doc>
<name>image_file_ac5589e7-92a3-470f-a933-40d6bad38052.pdf</name>
<page>1</page>
<text>
SHOPPING STORE
                REG 12-21
                CLERK     03:22 PM
                    2         618
                 1 MISC.
                 1           $0.49
                  STUFF      $7.99
                  SUBTOTAL   $8.48
                  TAX        $0.74
                  TOTAL   $9.22
                   CASH     $10.00
                   CHANGE    $0.78
                     NO REFUNDS
                    NO EXCHANGES
                     NO RETURNS
</text>
</doc>

<doc>
<name>image_file_764ae7bd-6b02-4ffb-b9d6-83e754c30952.pdf</name>
<page>1</page>
<text>
</text>
</doc>

<doc>
<name>image_file_1bfb88ea-a545-4b1f-a31f-051dbb90a378.pdf</name>
<page>1</page>
<text>
や  Tmuodr.
</text>
</doc>
</all_documents>
According to only the information in any chat history, any images given, or any document text provided within the context above, give a well-structured response (that starts with "According to") to:
What do you see?

"""
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64s[0],
                },
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64s[1],
                },
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64s[2],
                },
            },
        ],
    }
]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL2-Llama3-76B",
    messages=messages,
    temperature=0.0,
    max_tokens=64,  # run out of tokens otherwise
)

print(response.choices[0])

gives if I use >=128 tokens, i.e. no response and see error in "Error traceback" below.

Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='', role='assistant', function_call=None, tool_calls=None))

But with =64 tokens to avoid that error, model becomes blind and says:

Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='According to the provided information, there is no mention of a tower in the chat history, images, or document text. Therefore, I cannot identify any tower based on the given context.', role='assistant', function_call=None, tool_calls=None))

This also happens with InternVL 1-5.

Environment

docker lmdeploy

How I built lmdeploy here: https://github.com/InternLM/lmdeploy/issues/2164

4*H100 etc.  See notes in the above issue and the related linked issue: https://github.com/InternLM/lmdeploy/issues/2163

Error traceback

FYI if I use >64 tokens, server shows the below. So even with 8192 context length I can't send 3 images + a tiny bit of text? Why is that?

2024-07-27 05:41:49,924 - lmdeploy - ERROR - Truncate max_new_tokens to 128
2024-07-27 05:41:49,924 - lmdeploy - ERROR - run out of tokens. session_id=152.

Jul 27 '24 05:07 pseudotensor

You were right. But it's not InternVL to blame.

Many method I have tried with simple images token concate would sometimes fail on real multi image scenario.

The root reason for this, could be the model design limitation, or the multi-images data not enough in both pretrain and sft.

Jul 30 '24 03:07 lucasjinreal

My two cents.. I'm playing a bit with the InternVL2-2B, did fine-tune on a custom dataset with multiple images per prompt, using modelscope with LoRA. What I noticed is that with 8192 context length I can put a maximum of 4-5 images, keeping max_num = 4. So I fine-tuned using max_num=1, but results are very bad for multiple images. There should be some problem about multi-image in the architecture.

Aug 01 '24 10:08 rokopi-byte

Hi@rokopi-byte What is result of use costom dataset to fine-tune internvl model?

Aug 05 '24 08:08 funny000

Hi, thanks for your feedback. We are currently collecting multi-image data extensively, and we hope to significantly improve multi-image performance in the next release.

Aug 09 '24 05:08 czczup

Great!

Aug 09 '24 05:08 pseudotensor