Does InternVL support multi-image interleaved conversations:
According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.
# Demo code in readme.
# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)
question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
# prompt looks like this:
# <|im_start|>system\n{system_message}<|im_end|><|im_start|>user\n<img>placeholder ... </img>\n{question}<|im_end|><|im_start|>assistant\n
我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话,如果支持的话,每一轮对话中,图像的token应该如何插入,希望可以给一个例子。
I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each conversations?
# Does InternVL support something like this? (I know pixel_values should be passed,
# but I can't find demo code about putting pixel_values in interleaved text-and-image conversations)
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(question, response)
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values2, question, generation_config, history=history, return_history=True)
print(question, response)
question = "What is the difference about the two images?" # Describe the two pictures in detail
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(question, response)
model.chat只支持history为None时传入新的图片
def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
self.img_context_token_id = img_context_token_id
if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
else:
eos_token_id = tokenizer.eos_token_id
from .conversation import get_conv_template
template = get_conv_template(self.template)
image_bs = pixel_values.shape[0]
print(f'dynamic ViT batch size: {image_bs}')
if history is None:
history = []
image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
question = image_tokens + '\n' + question
else:
for (old_question, old_answer) in history:
template.append_message(template.roles[0], old_question)
template.append_message(template.roles[1], old_answer)
你可以仿照chat方法封装generate方法 或许你也可以尝试swift框架https://github.com/OpenGVLab/InternVL/issues/129
@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。
@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。
@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。
@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。
图像-文字交错对话是可以的,你可以参考这里
@hjh0119
我看了一下你们的代码,拼法貌似跟internvl-demo一样,都是放在了第一轮的user里面,跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样,image的token在每一轮的user里面,而不是集中在第一轮的user里面。
所以还是想跟internvl的作者确认一下,对于多轮带图片的对话,internvl正确的处理方式是什么。
@irexyc 我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.
@hjh0119
对于internvl: 你们的代码,输入看起来是交错的,每次都有新的图片,但是你们其实是在维护一个图片列表,然后最终的prompt还是用的这个函数拼在了最开始的user里面
对于deepseek-vl 你们没有维护image_list,而是根据<image_placeholder>来插入图片的embedding,而<image_placeholder>是在每轮的user当中的。
前者,如果新一轮的对话中有图片,会改变历史prompt(kv-cache没办法复用,需要重新算)。后者并不会改变,这两者我觉得并不一样。
我理解了 主要还是历史图片tokens处理 官方这里确实没有看到一个处理方式
我理解了 主要还是历史图片tokens处理 官方这里确实没有看到一个处理方式
你好,internvl2.0 已经支持了图文交错数据,欢迎体验。