InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

batch inference, multi image per sample

Open paulpacaud opened this issue 3 months ago • 4 comments

Hi,

The documentation does not explicit how to perform batch inference with multiple images. The documentation only talk about # batch inference, single image per sample (单图批处理):

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

Is it possible to perform batch inference with multiple images ? If so, how ?

Thanks

paulpacaud avatar Sep 08 '25 12:09 paulpacaud

We suggest you to use LMDeploy for multi-image batch inference. You can refer to their document.

If you want to infer with transformers backend, you can refer to the following code:

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values3 = load_image('./examples/image3.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0) + pixel_values3.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2, pixel_values3), dim=0)

questions = ['<image>\nDescribe the image in detail.', '<image>\n<image>\nDescribe the image in detail.']
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

Weiyun1025 avatar Sep 08 '25 13:09 Weiyun1025

What error are you getting and how are you loading the model?

Documentation on hugging face may provide more clarity on batch inference when using Transformers

jbuchananr avatar Sep 19 '25 22:09 jbuchananr

What error are you getting and how are you loading the model?

Documentation on hugging face may provide more clarity on batch inference when using Transformers

batch_chat中的image_token的替换逻辑,如果有多个占位符其实不会都被替换为image_tokens的?chat()中是遍历num_patches进行替换

queries = []

for idx, num_patches in enumerate(num_patches_list):
  question = questions[idx]
  if pixel_values is not None and '<image>' not in question:
    question = '<image>\n' + question
  template = get_conv_template(self.template)
  template.system_message = self.system_message
  template.append_message(template.roles[0], question)
  template.append_message(template.roles[1], None)
  query = template.get_prompt()
   
  image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
  query = query.replace('<image>', image_tokens, 1)
  queries.append(query)

shuhanyao avatar Sep 21 '25 13:09 shuhanyao

We suggest you to use LMDeploy for multi-image batch inference. You can refer to their document.

If you want to infer with transformers backend, you can refer to the following code:

batch inference, single image per sample (单图批处理)

pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda() pixel_values3 = load_image('./examples/image3.jpg', max_num=12).to(torch.bfloat16).cuda() num_patches_list = [pixel_values1.size(0), pixel_values2.size(0) + pixel_values3.size(0)] pixel_values = torch.cat((pixel_values1, pixel_values2, pixel_values3), dim=0)

questions = ['\nDescribe the image in detail.', '\n\nDescribe the image in detail.'] responses = model.batch_chat(tokenizer, pixel_values, num_patches_list=num_patches_list, questions=questions, generation_config=generation_config) for question, response in zip(questions, responses): print(f'User: {question}\nAssistant: {response}')

batch_chat() can't replace <image> to IMAGE_TOKENS correctly.

shuhanyao avatar Sep 22 '25 03:09 shuhanyao