InternVL batch inference, multi image per sample

Hi,

The documentation does not explicit how to perform batch inference with multiple images. The documentation only talk about # batch inference, single image per sample (单图批处理):

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

Is it possible to perform batch inference with multiple images ? If so, how ?

Thanks

Sep 08 '25 12:09 paulpacaud

We suggest you to use LMDeploy for multi-image batch inference. You can refer to their document.

If you want to infer with transformers backend, you can refer to the following code:

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values3 = load_image('./examples/image3.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0) + pixel_values3.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2, pixel_values3), dim=0)

questions = ['<image>\nDescribe the image in detail.', '<image>\n<image>\nDescribe the image in detail.']
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

Sep 08 '25 13:09 Weiyun1025

What error are you getting and how are you loading the model?

Documentation on hugging face may provide more clarity on batch inference when using Transformers

Sep 19 '25 22:09 jbuchananr

What error are you getting and how are you loading the model?

Documentation on hugging face may provide more clarity on batch inference when using Transformers

batch_chat中的image_token的替换逻辑，如果有多个占位符其实不会都被替换为image_tokens的？chat()中是遍历num_patches进行替换

queries = []

for idx, num_patches in enumerate(num_patches_list):
  question = questions[idx]
  if pixel_values is not None and '<image>' not in question:
    question = '<image>\n' + question
  template = get_conv_template(self.template)
  template.system_message = self.system_message
  template.append_message(template.roles[0], question)
  template.append_message(template.roles[1], None)
  query = template.get_prompt()
   
  image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
  query = query.replace('<image>', image_tokens, 1)
  queries.append(query)

Sep 21 '25 13:09 shuhanyao

We suggest you to use LMDeploy for multi-image batch inference. You can refer to their document.

If you want to infer with transformers backend, you can refer to the following code:

batch inference, single image per sample (单图批处理)

pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values3 = load_image('./examples/image3.jpg', max_num=12).to(torch.bfloat16).cuda() num_patches_list = [pixel_values1.size(0), pixel_values2.size(0) + pixel_values3.size(0)] pixel_values = torch.cat((pixel_values1, pixel_values2, pixel_values3), dim=0)

questions = ['\nDescribe the image in detail.', '\n\nDescribe the image in detail.'] responses = model.batch_chat(tokenizer, pixel_values, num_patches_list=num_patches_list, questions=questions, generation_config=generation_config) for question, response in zip(questions, responses): print(f'User: {question}\nAssistant: {response}')

batch_chat() can't replace <image> to IMAGE_TOKENS correctly.

Sep 22 '25 03:09 shuhanyao