Phi-3-Vision Batch Inference Prompt format
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [x] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
I can see that this feature request ticket has been marked as completed. Does this mean Phi-3-vision support batch inference now? If yes, can you please provide documentation. I was not able to find instructions/docs on how to do batch inferece with Phi-3-vision specially how the prompt format should be. I have tried doing that by replicating the single image prompt format but the Processor() doesn't work with a list of prompts.
Expected/desired behavior
Can do batch inferece: [ {img1, prompt1}, {img2, prompt2}, ...] output: [{response1}, {response2}, ...]
The model does support batch however, the processor dosen't so you should make a function for handling bathc inference. For example it is my code when using litserve.
def pad_sequence(self, sequences, padding_side='right', padding_value=0):
"""
Pad a list of sequences to the same length.
sequences: list of tensors in [seq_len, *] shape
"""
assert padding_side in ['right', 'left']
max_size = sequences[0].size()
trailing_dims = max_size[1:]
max_len = max(len(seq) for seq in sequences)
batch_size = len(sequences)
output = sequences[0].new_full((batch_size, max_len) + trailing_dims, padding_value)
for i, seq in enumerate(sequences):
length = seq.size(0)
if padding_side == 'right':
output.data[i, :length] = seq
else:
output.data[i, -length:] = seq
return output
def batch(self, input):
batched_input_id = []
batched_pixel_values = []
batched_image_sizes = []
for inp in input:
batched_input_id.append(inp["input_ids"].squeeze(0))
batched_pixel_values.append(inp["pixel_values"])
batched_image_sizes.append(inp["image_sizes"])
input_ids = self.pad_sequence(batched_input_id, padding_side='right', padding_value=self.model.pad_token_id)
attention_mask = input_ids != self.model.pad_token_id
pixel_values = torch.cat(batched_pixel_values, dim=0)
image_sizes = torch.cat(batched_image_sizes, dim=0)
batched_input = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"pixel_values": pixel_values,
"image_sizes": image_sizes
}
return batched_input
The unbatching would be like
generated_ids = self.model.generate(**inputs, eos_token_id=self.processor.tokenizer.eos_token_id, **generation_args)
generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
response = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
The response would be a list of generated text.