LLaVA-NeXT
LLaVA-NeXT copied to clipboard
Batch inference for LLaVA One Vision
Is there a feasible way to conduct batch inference with LLaVA One Vision?
I think most viable way is to use sglang's batch_run interface.
https://github.com/EvolvingLMMs-Lab/sglang/tree/dev/onevision_main
after launching the backend service, run this file.
https://github.com/EvolvingLMMs-Lab/sglang/blob/dev/onevision_main/examples/quick_start/srt_example_llava.py
You can use the early feature and see the PR.
https://github.com/sgl-project/sglang/pull/1123
We will update both our repo and sglang side to make sure everyone can use it more convienient.
Hi @Luodian Have you integrated it with SGLang frontend inference? I'm guessing the image and video preprocessing isn't done correctly I can't get it to work:
def image_qa(s, image_path, question):
s += sgl.user(sgl.image(image_path) + question)
s += sgl.assistant(sgl.gen("answer"))
def single():
state = image_qa.run(
image_path="images/cat.jpeg", question="What is this?", max_new_tokens=128
)
print(state, "\n")
This is an alternative solution without using SGLang. I started with the sample code from HF (https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) and did two key changes.
First, after creating the model, you need to change the configuration from right padding to left padding
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left' # Use left padding for batch processing
Second, you need to create a list of strings showing the modality of each element in the batch. In my case I just used images. Then, pass the list to the model's generate method.
modalities = ["image" for _ in images] # Repeat modality for every image in the batch
cont = model.generate(
input_ids_repeated,
images=image_tensor,
image_sizes=image_sizes,
modalities=modalities,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
Here is the complete script:
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import copy
import torch
import os
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.config.tokenizer_padding_side = 'left' # Use left padding for batch processing
model.eval()
images_dir = "test-images"
image_files = [file for file in os.listdir(images_dir) if file.endswith('.jpg')]
images = []
for file in image_files:
image_path = os.path.join(images_dir, file)
images.append(Image.open(image_path))
image_tensor = process_images(images, image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nProvided there is sufficient geographical and temporal information, provide a short description of the image based on the buildings, weather, objects and environment."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
input_ids_repeated = input_ids.repeat(len(images), 1)
image_sizes = [image.size for image in images]
modalities = ["image" for _ in images] # Repeat modality for every image in the batch
cont = model.generate(
input_ids_repeated,
images=image_tensor,
image_sizes=image_sizes,
modalities=modalities, # Added this line with the modalities
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
for file, output in zip(image_files, text_outputs):
print(f"\n{file}: {output}")
@dshatwell23 Thanks for your template. Do we need to resize the images to be the same dimension inside each batch?
@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.
@HenryJunW When I tested it all my images had different sizes, but I had to pass a list of tuples with the individual dimensions (image_sizes) to the generate method and it handled it internally.
I an wondering about how do you perform the padding for the input tokens?
Key Considerations for batch inference
Left Padding
Attention Mask: Properly handle the attention_mask to account for padded tokens.
Padding Prompts: Use torch.nn.utils.rnn.pad_sequence
The provided code snippet works effectively for my use case.
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import copy
import torch
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.config.tokenizer_padding_side = 'left' # Use left padding for batch processing
model.eval()
prompts = ['prompt1', 'prompt2', 'prompt3']
images = [Image.open("1.jpg"),
Image.open("2.jpg"),
Image.open("3.jpg")]
image_tensors = process_images(images, image_processor, model.config)
image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = [f'<image>\n_your_template: [{prompt}]' for prompt in prompts]
prompt_questions=[]
for qu in question:
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], qu)
conv.append_message(conv.roles[1], None)
prompt_questions.append(conv.get_prompt())
input_ids = [tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") for prompt_question in prompt_questions]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id).to(device)
attention_mask = (input_ids != tokenizer.pad_token_id).to(dtype=torch.float16)
image_sizes = [image.size for image in images]
modalities = ["image" for _ in images]
cont = model.generate(
input_ids,
images=image_tensors,
image_sizes=image_sizes,
modalities=modalities,
attention_mask=attention_mask,
do_sample=False,
temperature=0,
max_new_tokens=4096
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
How about performing inference with the same question but on different batches of images?