LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

Questions to reproduce BLIP 2 examples

Open yonatanbitton opened this issue 2 years ago • 11 comments

Hi. I'm trying to use your colab.

I'm trying the most powerful model (the default in the colab):

model, vis_processors, _ = load_model_and_preprocess(
    name="blip2_t5", model_type="pretrain_flant5xxl", is_eval=True, device=device
)

I'm trying to reproduce two examples in the question answering way you have in the colab:

ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"})

and I also try just feeding the prompt as is:

ans = model.generate({"image": image, "prompt": f"{question}"})

I'm trying to reproduce this example: image

I receive the following output: image

  • I receive only pepperoni, and not the other ingredients

and for this one: image That's the output I receive: image

  • I don't receive any explanation beyond yes. The paper figure shows "it's a house that looks like it's upside down"

How can I receive the behavior described in the paper?

Thanks

yonatanbitton avatar Feb 02 '23 21:02 yonatanbitton

Hi @yonatanbitton, thanks for your question. The examples in the paper are obtained using nucleus sampling. Please set it to true to activate sampling: https://github.com/salesforce/LAVIS/blob/5ddd9b4e5149dbc514e81110e03d28458a754c5d/lavis/models/blip2_models/blip2_t5.py#L149 You may also want to set the min_length and max_length to be larger to get longer outputs.

LiJunnan1992 avatar Feb 03 '23 00:02 LiJunnan1992

Also, when using beam search, you may try to increase the length_penalty term which can encourage longer sequences.

LiJunnan1992 avatar Feb 03 '23 00:02 LiJunnan1992

Also, model sizes matter.

For best quality, you may want to use large models, e.g. BLIP2_flant5xxl. You can also try the demo.

dxli94 avatar Feb 03 '23 09:02 dxli94

Thanks for the response!

@dxli94 I'm loading this model which I understand is the one you mentioned, isn't it? load_model_and_preprocess(name="blip2_t5", model_type="pretrain_flant5xxl"..) @LiJunnan1992 I've changed to nucleus_sampling and increased max_length from 30 to 60.

That's what I currently receive for the Pizza example: image

What am I doing wrong?

I did try the new demo, but since it sends the request to a server, I don't see how the model inference works. I see that the Pizza example works much better in the demo, and want to understand how to modify the default colab to make it work the same (or to provide the best response). Following your instructions, using nucleus_sampling=True and max_length=60 still doesn't yields the expected response.

I'm trying to use the best model for an upcoming paper submission 🙏 🙂

yonatanbitton avatar Feb 04 '23 00:02 yonatanbitton

The web demo uses the same generate() function as the notebook demo, which means that you should be able to get the same response from both demos under the same hyperparameters. There might be some very small difference due to the hardware used, could you try some other examples and see if you can get same results from both demos?

LiJunnan1992 avatar Feb 04 '23 00:02 LiJunnan1992

Thanks. You are correct 🙂 I took the following parameters and managed to reproduce 3 of the examples in the demo:

max_length = 30
length_penalty = 1
repetition_penalty = 1.5
temperature=1
ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True, 
                     max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)

are there any more recommended parameters to receive the best explanation? (Not a dialog explanation) Or changes to my current parameters?

Thanks again for the great work!

yonatanbitton avatar Feb 04 '23 01:02 yonatanbitton

@LiJunnan1992 additionally, I want to reproduce the results on VQA, Image Captioning and Cross-Modal retrieval. Regarding cross-modal retrieval, I saw a related issue, I understand that I can wait for a specific implementation? Regarding VQA & Image Captioning. What are the parameters that I should use? Should I use Beam Search or Nucleus sampling? What should be the temperature, length penalty, repeat penalty, etc? Regarding prompts, I understand that in Image captioning there isn't any prompt, but in VQA I understand I should also use this prompt "Question: {question} Answer:".

Thanks

yonatanbitton avatar Feb 04 '23 12:02 yonatanbitton

Please run these scripts to evaluate on Image Captioning and Image-Text Retrieval. https://github.com/salesforce/LAVIS/tree/main/run_scripts/blip2/eval

We are working on the VQA evaluation script.

LiJunnan1992 avatar Feb 04 '23 13:02 LiJunnan1992

Thanks! Actually it's different datasets, and the parameters to be used aren't clear from the yaml script nor from the evaluation script.

In other words, if someone wants to evaluate your models for image captioning / VQA / cross-modal retrieval tasks for a paper report, how should they use/call the model.

In the code I see these default parameters: link

Using the following parameters I was able to reproduce some of the explanations from the paper. I can use them for all of the tasks but I want to make sure I use it similarly to how you use them in order to report the most faithful results.

max_length = 30
length_penalty = 1
repetition_penalty = 1.5
temperature=1

That's the call for "instructed chat":

ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True, 
                     max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)

That's for image captioning:

model.generate({"image": image})

That's for VQA:

ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"}, use_nucleus_sampling=True, 
                     max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)

Is it correct? Thanks 🙂

yonatanbitton avatar Feb 04 '23 13:02 yonatanbitton

@LiJunnan1992 Cloud you please provide the zeroshot VQA evaluation code and config for BLIP2-OPT? Currently, I can only find the config for T5. Thanks!

YuanLiuuuuuu avatar Apr 05 '23 10:04 YuanLiuuuuuu

Thanks! Actually it's different datasets, and the parameters to be used aren't clear from the yaml script nor from the evaluation script.

In other words, if someone wants to evaluate your models for image captioning / VQA / cross-modal retrieval tasks for a paper report, how should they use/call the model.

In the code I see these default parameters: link

Using the following parameters I was able to reproduce some of the explanations from the paper. I can use them for all of the tasks but I want to make sure I use it similarly to how you use them in order to report the most faithful results.

max_length = 30
length_penalty = 1
repetition_penalty = 1.5
temperature=1

That's the call for "instructed chat":

ans = model.generate({"image": image, "prompt": f"{question}"}, use_nucleus_sampling=True, 
                     max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)

That's for image captioning:

model.generate({"image": image})

That's for VQA:

ans = model.generate({"image": image, "prompt": f"Question: {question} Answer:"}, use_nucleus_sampling=True, 
                     max_length=max_length, length_penalty=length_penalty, repetition_penalty=repetition_penalty, temperature=temperature)

Is it correct? Thanks 🙂

Hi, can you reproduce the zeroshot vqa resultss for OPT? I find the accuracy is very low. Thanks!

YuanLiuuuuuu avatar Apr 07 '23 03:04 YuanLiuuuuuu