InternVL Does InternVL support multi-image interleaved conversations:

According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.

# Demo code in readme.

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# prompt looks like this:
# <|im_start|>system\n{system_message}<|im_end|><|im_start|>user\n<img>placeholder ... </img>\n{question}<|im_end|><|im_start|>assistant\n

我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话，如果支持的话，每一轮对话中，图像的token应该如何插入，希望可以给一个例子。

I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each conversations?

# Does InternVL support something like this? (I know pixel_values should be passed, 
# but I can't find demo code about putting pixel_values in interleaved text-and-image conversations)


pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(question, response)

pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values2, question, generation_config, history=history, return_history=True)
print(question, response)

question = "What is the difference about the two images?" # Describe the two pictures in detail
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(question, response)

May 08 '24 05:05 irexyc

model.chat只支持history为None时传入新的图片

    def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
             IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):

        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
        self.img_context_token_id = img_context_token_id
        if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
            eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
        else:
            eos_token_id = tokenizer.eos_token_id

        from .conversation import get_conv_template

        template = get_conv_template(self.template)
        image_bs = pixel_values.shape[0]
        print(f'dynamic ViT batch size: {image_bs}')
        if history is None:
            history = []
            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
            question = image_tokens + '\n' + question
        else:
            for (old_question, old_answer) in history:
                template.append_message(template.roles[0], old_question)
                template.append_message(template.roles[1], old_answer)

你可以仿照chat方法封装generate方法或许你也可以尝试swift框架https://github.com/OpenGVLab/InternVL/issues/129

May 08 '24 08:05 hjh0119

@hjh0119 现在的代码不支持我说的用法，像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力，即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

May 08 '24 08:05 irexyc

@hjh0119 现在的代码不支持我说的用法，像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力，即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

图像-文字交错对话是可以的，你可以参考这里

May 08 '24 08:05 hjh0119

@hjh0119

我看了一下你们的代码，拼法貌似跟internvl-demo一样，都是放在了第一轮的user里面，跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样，image的token在每一轮的user里面，而不是集中在第一轮的user里面。

所以还是想跟internvl的作者确认一下，对于多轮带图片的对话，internvl正确的处理方式是什么。

May 08 '24 09:05 irexyc

@irexyc 我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样

<<< Describe this image.
Input a media path or URL <<<  http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.

May 08 '24 09:05 hjh0119

@hjh0119

对于internvl：你们的代码，输入看起来是交错的，每次都有新的图片，但是你们其实是在维护一个图片列表，然后最终的prompt还是用的这个函数拼在了最开始的user里面

对于deepseek-vl 你们没有维护image_list，而是根据<image_placeholder>来插入图片的embedding，而<image_placeholder>是在每轮的user当中的。

前者，如果新一轮的对话中有图片，会改变历史prompt(kv-cache没办法复用，需要重新算)。后者并不会改变，这两者我觉得并不一样。

May 08 '24 09:05 irexyc

我理解了主要还是历史图片tokens处理官方这里确实没有看到一个处理方式

May 08 '24 11:05 hjh0119

我理解了主要还是历史图片tokens处理官方这里确实没有看到一个处理方式

你好，internvl2.0 已经支持了图文交错数据，欢迎体验。

Jul 16 '24 11:07 G-z-w