LLaVA-NeXT Batch inference for LLaVA One Vision

Is there a feasible way to conduct batch inference with LLaVA One Vision?

Aug 20 '24 14:08 JiayiGuo821

I think most viable way is to use sglang's batch_run interface.

https://github.com/EvolvingLMMs-Lab/sglang/tree/dev/onevision_main

after launching the backend service, run this file.

https://github.com/EvolvingLMMs-Lab/sglang/blob/dev/onevision_main/examples/quick_start/srt_example_llava.py

You can use the early feature and see the PR.

https://github.com/sgl-project/sglang/pull/1123

We will update both our repo and sglang side to make sure everyone can use it more convienient.

Aug 21 '24 02:08 Luodian

Hi @Luodian Have you integrated it with SGLang frontend inference? I'm guessing the image and video preprocessing isn't done correctly I can't get it to work:

def image_qa(s, image_path, question):
    s += sgl.user(sgl.image(image_path) + question)
    s += sgl.assistant(sgl.gen("answer"))


def single():
    state = image_qa.run(
        image_path="images/cat.jpeg", question="What is this?", max_new_tokens=128
    )
    print(state, "\n")

Sep 02 '24 14:09 ehayeshaiper

This is an alternative solution without using SGLang. I started with the sample code from HF (https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) and did two key changes.

First, after creating the model, you need to change the configuration from right padding to left padding

tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left'  # Use left padding for batch processing

Second, you need to create a list of strings showing the modality of each element in the batch. In my case I just used images. Then, pass the list to the model's generate method.

modalities = ["image" for _ in images]          # Repeat modality for every image in the batch

cont = model.generate(
    input_ids_repeated,
    images=image_tensor,
    image_sizes=image_sizes,
    modalities=modalities,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

Here is the complete script:

from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

from PIL import Image
import copy
import torch
import os

import warnings

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left'  # Use left padding for batch processing

model.eval()

images_dir = "test-images"
image_files = [file for file in os.listdir(images_dir) if file.endswith('.jpg')]
images = []
for file in image_files:
    image_path = os.path.join(images_dir, file)
    images.append(Image.open(image_path))
image_tensor = process_images(images, image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nProvided there is sufficient geographical and temporal information, provide a short description of the image based on the buildings, weather, objects and environment."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
input_ids_repeated = input_ids.repeat(len(images), 1)

image_sizes = [image.size for image in images]
modalities = ["image" for _ in images]          # Repeat modality for every image in the batch

cont = model.generate(
    input_ids_repeated,
    images=image_tensor,
    image_sizes=image_sizes,
    modalities=modalities,                   # Added this line with the modalities
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)

for file, output in zip(image_files, text_outputs):
    print(f"\n{file}: {output}")

Sep 18 '24 01:09 dshatwell23

@dshatwell23 Thanks for your template. Do we need to resize the images to be the same dimension inside each batch?

Sep 18 '24 01:09 HenryJunW

@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.

Sep 18 '24 01:09 dshatwell23

@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.

I an wondering about how do you perform the padding for the input tokens?

Nov 04 '24 09:11 HaozheZhao

Key Considerations for batch inference

Left Padding Attention Mask: Properly handle the attention_mask to account for padded tokens. Padding Prompts: Use torch.nn.utils.rnn.pad_sequence The provided code snippet works effectively for my use case.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"

tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.config.tokenizer_padding_side = 'left'  # Use left padding for batch processing
model.eval()

prompts = ['prompt1', 'prompt2', 'prompt3']
images = [Image.open("1.jpg"), 
        Image.open("2.jpg"),
        Image.open("3.jpg")]

image_tensors = process_images(images, image_processor, model.config)
image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models

question = [f'<image>\n_your_template: [{prompt}]' for prompt in prompts]
prompt_questions=[]
for qu in question:
    conv = copy.deepcopy(conv_templates[conv_template])
    conv.append_message(conv.roles[0], qu)
    conv.append_message(conv.roles[1], None)
    prompt_questions.append(conv.get_prompt())

input_ids = [tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") for prompt_question in prompt_questions]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id).to(device)

attention_mask = (input_ids != tokenizer.pad_token_id).to(dtype=torch.float16)
image_sizes = [image.size for image in images]
modalities = ["image" for _ in images]
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    modalities=modalities,
    attention_mask=attention_mask,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Mar 12 '25 08:03 yifan123

How about performing inference with the same question but on different batches of images?

Mar 30 '25 03:03 khoa16122004

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Batch inference for LLaVA One Vision

LLaVA-NeXT
LLaVA-NeXT copied to clipboard