InternVL [Bug] internvl2.5-8B多轮对话后，history长度积累到一定长度，模型无法继续针对新图片进行问答

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

针对图片1进行多轮对话之后，再进行图片2的输入和问答，将会给出针对图片1的输出内容；或者先对图片1进行理解，return history，如果这轮输出长度过长，引用当前history，再进行针对图片2的的问答，回答内容将会只与图片1相关，而与图片2无关

Reproduction

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

pixel_values = load_image('./example.png', max_num=12).to(torch.float16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

question = '<image>\n详细描述一下这副图片.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = '根据这幅图片写一首诗。'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

。。。。。

pixel_values2 = load_image('./rabbit.jpg', max_num=12).to(torch.float16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

question = '<image>\n这副图片描述了什么'
response = model.chat(tokenizer, pixel_values2, question, generation_config, history=history)
print(f'User: {question}\nAssistant: {response}')

Environment

transformers 4.45.2
torch 2.0.1

Error traceback

Apr 29 '25 11:04 moonlightian

I have the same issue. In my first image attempt, I was able to get an answer from model.chat(). However, when I tried to use the history, I got an empty error traceback. So I added a lot of debug messages, and finally got a result:

Analysis of the Traceback:
Origin: The error occurs within the self.model.chat() call, specifically when it internally calls self.generate().
Location: The failure point is inside the modeling_internvl_chat.py script (part of the InternVL model code loaded from the
Hugging Face cache) at line 333.
Error: AssertionError triggered by the line assert selected.sum() != 0.

Interpretation:
This AssertionError suggests that during the text generation process when history is provided, a selection mechanism within the model's generation code is resulting in zero items being selected (selected.sum() == 0). This could relate to:

• Attention Masking: How the model handles attention between the text history, the new query, and the image features.
When history is added, the attention patterns change, and perhaps some internal masking is incorrectly leading to a state
where no tokens are attended to or selected for the next step.

• Token Type IDs / Embeddings: How the model combines text embeddings from history/query with image embeddings.
There might be an incompatibility or indexing issue when history is present alongside images.

• Internal State: The generation process maintains internal states (like past key-values for attention).
Adding history might modify this state in a way that conflicts with how image features are incorporated
in subsequent turns, leading to this assertion failure.

May 01 '25 14:05 UcanYusuf

Hi, for the multi-image multi-round conversation, you should concat the pixel_values and num_patches_list outside the model.chat() function call. An example could be found here For the examples above, the correct writing is:

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

pixel_values1 = load_image('./example.png', max_num=12).to(torch.float16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

question = '<image>\n详细描述一下这副图片.'
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = '根据这幅图片写一首诗。'
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

。。。。。

pixel_values2 = load_image('./rabbit.jpg', max_num=12).to(torch.float16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = '<image>\n这副图片描述了什么'
response = model.chat(tokenizer, pixel_values, question, generation_config, history=history, num_patches_list=num_patches_list)
print(f'User: {question}\nAssistant: {response}')

May 14 '25 06:05 Ganlin-Yang

num_patches_list=num_patches_list

Hi，Thank you for your reply and that works well for me. So it's necessary to concat pixel values once there is another new image gets input during a multi-round conversation? And expand the num_patches_list as well, right?

May 22 '25 06:05 moonlightian