BLIP
BLIP copied to clipboard
Can I ask more than 1 question simultaneously through the blip_vqa model?
I know how to ask the same question for multiple images at the same time and it will return different results to different images; how do I swap? I mean: can I ask multiple questions about the same image and return multiple different answers simultaneously(run the model only once)? If so, how can I do it? I tried feeding multiple questions as a list into the blip_vqa model but it raise an error seems like dimension of tensors mismatch. Thank you for your excellent work and look forward to your reply.
Hey you can ask one by one. something like this
def get_answer(img_url, questions):
image = load_demo_image(img_url, image_size=400, device=device)
with torch.no_grad():
# beam search
answers = {}
for question in questions:
answer = model_vqa(image, question, train=False, inference='generate')
answers[question] = answer
# print('{} \t\t answer: {}'.format(question, answer))
df = pd.DataFrame(answers).T.reset_index()
df.columns = ['Questions', 'Answers']
#print(df)
return df
questions = [question1, questions2, questions3,...]
answers = get_answer(img_url, questions)
@laxmimerit Thanks for reply. I am using something like your code. But in this method, the model must encode the same image for several times(equal to the number of questions). Actually, I want to know if there is a method which the image only need to be encoding only one time.
It should be possible if the GIF is true.
You can encode the image once but repeat the image encoding multiple times along the batch dimension.
@LiJunnan1992 Nice advice, thanks for the reply.
If you only want to encode image once try this:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to(device)
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
def ask(image, questions):
# preprocess image
image = processor.image_processor(image, return_tensors="pt").to(device)
# preprocess texts
questions = [processor.tokenizer(text=q, return_tensors="pt").to(device) for q in questions]
with torch.no_grad():
# compute image embedding
vision_outputs = model.vision_model(pixel_values=image["pixel_values"])
image_embeds = vision_outputs[0]
image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)
answers = []
for question in questions:
# compute text encodings
question_outputs = model.text_encoder(
input_ids=question["input_ids"],
attention_mask=None,
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_attention_mask,
return_dict=False,
)
question_embeds = question_outputs[0]
question_attention_mask = torch.ones(question_embeds.size()[:-1], dtype=torch.long).to(question_embeds.device)
bos_ids = torch.full(
(question_embeds.size(0), 1), fill_value=model.decoder_start_token_id, device=question_embeds.device
)
outputs = model.text_decoder.generate(
input_ids=bos_ids,
eos_token_id=model.config.text_config.sep_token_id,
pad_token_id=model.config.text_config.pad_token_id,
encoder_hidden_states=question_embeds,
encoder_attention_mask=question_attention_mask,
)
answer = processor.decode(outputs[0], skip_special_tokens=True)
answers.append(answer)
return answers
questions = [
"describe image",
"is there a motorcycle?",
"is there a person?",
"is there a group of people?",
"is someone touching motorcycle?",
"is road empty?",
]
ask(image, questions)
output
['motorcycle', 'yes', 'no', 'no', 'no', 'no']