PhiCookBook icon indicating copy to clipboard operation
PhiCookBook copied to clipboard

Phi-3-Vision Batch Inference Prompt format

Open nzarif opened this issue 1 year ago • 1 comments

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

I can see that this feature request ticket has been marked as completed. Does this mean Phi-3-vision support batch inference now? If yes, can you please provide documentation. I was not able to find instructions/docs on how to do batch inferece with Phi-3-vision specially how the prompt format should be. I have tried doing that by replicating the single image prompt format but the Processor() doesn't work with a list of prompts.

Expected/desired behavior

Can do batch inferece: [ {img1, prompt1}, {img2, prompt2}, ...] output: [{response1}, {response2}, ...]

nzarif avatar Sep 26 '24 00:09 nzarif

The model does support batch however, the processor dosen't so you should make a function for handling bathc inference. For example it is my code when using litserve.

   def pad_sequence(self, sequences, padding_side='right', padding_value=0):
        """
        Pad a list of sequences to the same length.
        sequences: list of tensors in [seq_len, *] shape
        """
        assert padding_side in ['right', 'left']
        max_size = sequences[0].size()
        trailing_dims = max_size[1:]
        max_len = max(len(seq) for seq in sequences)
        batch_size = len(sequences)
        output = sequences[0].new_full((batch_size, max_len) + trailing_dims, padding_value)
        for i, seq in enumerate(sequences):
            length = seq.size(0)
            if padding_side == 'right':
                output.data[i, :length] = seq
            else:
                output.data[i, -length:] = seq
        return output

    def batch(self, input):
        batched_input_id = []
        batched_pixel_values = []
        batched_image_sizes = []
        
        for inp in input:
            batched_input_id.append(inp["input_ids"].squeeze(0))
            batched_pixel_values.append(inp["pixel_values"])
            batched_image_sizes.append(inp["image_sizes"])

        input_ids = self.pad_sequence(batched_input_id, padding_side='right', padding_value=self.model.pad_token_id)
        attention_mask = input_ids != self.model.pad_token_id
        pixel_values = torch.cat(batched_pixel_values, dim=0)
        image_sizes = torch.cat(batched_image_sizes, dim=0)

        batched_input = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "pixel_values": pixel_values,
            "image_sizes": image_sizes
        }
        
        return batched_input

The unbatching would be like

  generated_ids = self.model.generate(**inputs, eos_token_id=self.processor.tokenizer.eos_token_id, **generation_args)
  generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
  response = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

The response would be a list of generated text.

2U1 avatar Oct 02 '24 10:10 2U1