BLIP Can I ask more than 1 question simultaneously through the blip

I know how to ask the same question for multiple images at the same time and it will return different results to different images; how do I swap? I mean: can I ask multiple questions about the same image and return multiple different answers simultaneously(run the model only once)? If so, how can I do it? I tried feeding multiple questions as a list into the blip_vqa model but it raise an error seems like dimension of tensors mismatch. Thank you for your excellent work and look forward to your reply.

Jul 06 '22 06:07 SKBL5694

Hey you can ask one by one. something like this

def get_answer(img_url, questions):
  image = load_demo_image(img_url, image_size=400, device=device)
  with torch.no_grad():
      # beam search
      answers = {}
      for question in questions:
        answer = model_vqa(image, question, train=False, inference='generate')
        answers[question] = answer
        # print('{} \t\t answer: {}'.format(question, answer))
      df = pd.DataFrame(answers).T.reset_index()
      df.columns = ['Questions', 'Answers']
      #print(df)

   return df

questions = [question1, questions2, questions3,...]
answers = get_answer(img_url, questions)

Jul 06 '22 08:07 laxmimerit

@laxmimerit Thanks for reply. I am using something like your code. But in this method, the model must encode the same image for several times(equal to the number of questions). Actually, I want to know if there is a method which the image only need to be encoding only one time.

Jul 06 '22 08:07 SKBL5694

It should be possible if the GIF is true.

GIF

Jul 06 '22 14:07 woctezuma

You can encode the image once but repeat the image encoding multiple times along the batch dimension.

Jul 12 '22 07:07 LiJunnan1992

@LiJunnan1992 Nice advice, thanks for the reply.

Jul 12 '22 07:07 SKBL5694

If you only want to encode image once try this:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to(device)
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")

def ask(image, questions):
    # preprocess image
    image = processor.image_processor(image, return_tensors="pt").to(device)
    
    # preprocess texts
    questions = [processor.tokenizer(text=q, return_tensors="pt").to(device) for q in questions]
    
    with torch.no_grad():
        # compute image embedding
        vision_outputs = model.vision_model(pixel_values=image["pixel_values"])
        image_embeds = vision_outputs[0]
        image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)
        
        answers = []
        for question in questions:
            # compute text encodings
            question_outputs = model.text_encoder(
                input_ids=question["input_ids"],
                attention_mask=None,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_attention_mask,
                return_dict=False,
            )
            question_embeds = question_outputs[0]
            question_attention_mask = torch.ones(question_embeds.size()[:-1], dtype=torch.long).to(question_embeds.device)
            bos_ids = torch.full(
                (question_embeds.size(0), 1), fill_value=model.decoder_start_token_id, device=question_embeds.device
            )

            outputs = model.text_decoder.generate(
                input_ids=bos_ids,
                eos_token_id=model.config.text_config.sep_token_id,
                pad_token_id=model.config.text_config.pad_token_id,
                encoder_hidden_states=question_embeds,
                encoder_attention_mask=question_attention_mask,
            )
            
            answer = processor.decode(outputs[0], skip_special_tokens=True)
            answers.append(answer)
        
        return answers

questions = [
    "describe image",
    "is there a motorcycle?",
    "is there a person?",
    "is there a group of people?",
    "is someone touching motorcycle?",
    "is road empty?",
]

ask(image, questions)

output


['motorcycle', 'yes', 'no', 'no', 'no', 'no']

Mar 10 '23 07:03 imneonizer

BLIP
BLIP copied to clipboard

Can I ask more than 1 question simultaneously through the blip_vqa model?

output

BLIP BLIP copied to clipboard

Can I ask more than 1 question simultaneously through the blip_vqa model?

output

BLIP
BLIP copied to clipboard