LAVIS
LAVIS copied to clipboard
Questions to reproduce BLIP 2 examples
Hi. I'm trying to use your colab.
I'm trying the most powerful model (the default in the colab):
model, vis_processors, _ = load_model_and_preprocess(
name="blip2_t5", model_type="pretrain_flant5xxl", is_eval=True, device=device
)
I'm trying to reproduce two examples in the question answering way you have in the colab:
ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"})
and I also try just feeding the prompt as is:
ans = model.generate({"image": image, "prompt": f"{question}"})
I'm trying to reproduce this example:
I receive the following output:
- I receive only pepperoni, and not the other ingredients
and for this one:
That's the output I receive:
- I don't receive any explanation beyond yes. The paper figure shows "it's a house that looks like it's upside down"
How can I receive the behavior described in the paper?
Thanks
Hi @yonatanbitton, thanks for your question. The examples in the paper are obtained using nucleus sampling. Please set it to true to activate sampling: https://github.com/salesforce/LAVIS/blob/5ddd9b4e5149dbc514e81110e03d28458a754c5d/lavis/models/blip2_models/blip2_t5.py#L149 You may also want to set the min_length and max_length to be larger to get longer outputs.
Also, when using beam search, you may try to increase the length_penalty term which can encourage longer sequences.
Also, model sizes matter.
For best quality, you may want to use large models, e.g. BLIP2_flant5xxl. You can also try the demo.
Thanks for the response!
@dxli94 I'm loading this model which I understand is the one you mentioned, isn't it?
load_model_and_preprocess(name="blip2_t5", model_type="pretrain_flant5xxl"..)
@LiJunnan1992 I've changed to nucleus_sampling and increased max_length from 30 to 60.
That's what I currently receive for the Pizza example:
What am I doing wrong?
I did try the new demo, but since it sends the request to a server, I don't see how the model inference works. I see that the Pizza example works much better in the demo, and want to understand how to modify the default colab to make it work the same (or to provide the best response). Following your instructions, using nucleus_sampling=True
and max_length=60
still doesn't yields the expected response.
I'm trying to use the best model for an upcoming paper submission 🙏 🙂
The web demo uses the same generate() function as the notebook demo, which means that you should be able to get the same response from both demos under the same hyperparameters. There might be some very small difference due to the hardware used, could you try some other examples and see if you can get same results from both demos?
Thanks. You are correct 🙂 I took the following parameters and managed to reproduce 3 of the examples in the demo:
max_length = 30
length_penalty = 1
repetition_penalty = 1.5
temperature=1
ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True,
max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)
are there any more recommended parameters to receive the best explanation? (Not a dialog explanation) Or changes to my current parameters?
Thanks again for the great work!
@LiJunnan1992 additionally, I want to reproduce the results on VQA, Image Captioning and Cross-Modal retrieval. Regarding cross-modal retrieval, I saw a related issue, I understand that I can wait for a specific implementation? Regarding VQA & Image Captioning. What are the parameters that I should use? Should I use Beam Search or Nucleus sampling? What should be the temperature, length penalty, repeat penalty, etc? Regarding prompts, I understand that in Image captioning there isn't any prompt, but in VQA I understand I should also use this prompt "Question: {question} Answer:".
Thanks
Please run these scripts to evaluate on Image Captioning and Image-Text Retrieval. https://github.com/salesforce/LAVIS/tree/main/run_scripts/blip2/eval
We are working on the VQA evaluation script.
Thanks! Actually it's different datasets, and the parameters to be used aren't clear from the yaml script nor from the evaluation script.
In other words, if someone wants to evaluate your models for image captioning / VQA / cross-modal retrieval tasks for a paper report, how should they use/call the model.
In the code I see these default parameters: link
Using the following parameters I was able to reproduce some of the explanations from the paper. I can use them for all of the tasks but I want to make sure I use it similarly to how you use them in order to report the most faithful results.
max_length = 30
length_penalty = 1
repetition_penalty = 1.5
temperature=1
That's the call for "instructed chat":
ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True,
max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)
That's for image captioning:
model.generate({"image": image})
That's for VQA:
ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"}, use_nucleus_sampling=True,
max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)
Is it correct? Thanks 🙂
@LiJunnan1992 Cloud you please provide the zeroshot VQA evaluation code and config for BLIP2-OPT? Currently, I can only find the config for T5. Thanks!
Thanks! Actually it's different datasets, and the parameters to be used aren't clear from the yaml script nor from the evaluation script.
In other words, if someone wants to evaluate your models for image captioning / VQA / cross-modal retrieval tasks for a paper report, how should they use/call the model.
In the code I see these default parameters: link
Using the following parameters I was able to reproduce some of the explanations from the paper. I can use them for all of the tasks but I want to make sure I use it similarly to how you use them in order to report the most faithful results.
max_length = 30 length_penalty = 1 repetition_penalty = 1.5 temperature=1
That's the call for "instructed chat":
ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True, max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)
That's for image captioning:
model.generate({"image": image})
That's for VQA:
ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"}, use_nucleus_sampling=True, max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)
Is it correct? Thanks 🙂
Hi, can you reproduce the zeroshot vqa resultss for OPT? I find the accuracy is very low. Thanks!