BLIP icon indicating copy to clipboard operation
BLIP copied to clipboard

Can I ask more than 1 question simultaneously through the blip_vqa model?

Open SKBL5694 opened this issue 1 year ago • 5 comments

I know how to ask the same question for multiple images at the same time and it will return different results to different images; how do I swap? I mean: can I ask multiple questions about the same image and return multiple different answers simultaneously(run the model only once)? If so, how can I do it? I tried feeding multiple questions as a list into the blip_vqa model but it raise an error seems like dimension of tensors mismatch. Thank you for your excellent work and look forward to your reply.

SKBL5694 avatar Jul 06 '22 06:07 SKBL5694

Hey you can ask one by one. something like this

def get_answer(img_url, questions):
  image = load_demo_image(img_url, image_size=400, device=device)
  with torch.no_grad():
      # beam search
      answers = {}
      for question in questions:
        answer = model_vqa(image, question, train=False, inference='generate')
        answers[question] = answer
        # print('{} \t\t answer: {}'.format(question, answer))
      df = pd.DataFrame(answers).T.reset_index()
      df.columns = ['Questions', 'Answers']
      #print(df)

   return df
questions = [question1, questions2, questions3,...]
answers = get_answer(img_url, questions)

laxmimerit avatar Jul 06 '22 08:07 laxmimerit

@laxmimerit Thanks for reply. I am using something like your code. But in this method, the model must encode the same image for several times(equal to the number of questions). Actually, I want to know if there is a method which the image only need to be encoding only one time.

SKBL5694 avatar Jul 06 '22 08:07 SKBL5694

It should be possible if the GIF is true.

GIF

woctezuma avatar Jul 06 '22 14:07 woctezuma

You can encode the image once but repeat the image encoding multiple times along the batch dimension.

LiJunnan1992 avatar Jul 12 '22 07:07 LiJunnan1992

@LiJunnan1992 Nice advice, thanks for the reply.

SKBL5694 avatar Jul 12 '22 07:07 SKBL5694

If you only want to encode image once try this:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to(device)
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")

def ask(image, questions):
    # preprocess image
    image = processor.image_processor(image, return_tensors="pt").to(device)
    
    # preprocess texts
    questions = [processor.tokenizer(text=q, return_tensors="pt").to(device) for q in questions]
    
    with torch.no_grad():
        # compute image embedding
        vision_outputs = model.vision_model(pixel_values=image["pixel_values"])
        image_embeds = vision_outputs[0]
        image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)
        
        answers = []
        for question in questions:
            # compute text encodings
            question_outputs = model.text_encoder(
                input_ids=question["input_ids"],
                attention_mask=None,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_attention_mask,
                return_dict=False,
            )
            question_embeds = question_outputs[0]
            question_attention_mask = torch.ones(question_embeds.size()[:-1], dtype=torch.long).to(question_embeds.device)
            bos_ids = torch.full(
                (question_embeds.size(0), 1), fill_value=model.decoder_start_token_id, device=question_embeds.device
            )

            outputs = model.text_decoder.generate(
                input_ids=bos_ids,
                eos_token_id=model.config.text_config.sep_token_id,
                pad_token_id=model.config.text_config.pad_token_id,
                encoder_hidden_states=question_embeds,
                encoder_attention_mask=question_attention_mask,
            )
            
            answer = processor.decode(outputs[0], skip_special_tokens=True)
            answers.append(answer)
        
        return answers

questions = [
    "describe image",
    "is there a motorcycle?",
    "is there a person?",
    "is there a group of people?",
    "is someone touching motorcycle?",
    "is road empty?",
]

ask(image, questions)

output


['motorcycle', 'yes', 'no', 'no', 'no', 'no']

imneonizer avatar Mar 10 '23 07:03 imneonizer