what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?

Open gyula-coder opened this issue 2 years ago • 1 comments

In my understanding, VQA is similar with the ability of zero-shot image-to-text generation mentioned in the BLIP2 paper. They all give the answer about prompt(question / natural language instructions) conditioned on images. So I'm curious about what is the difference between the Instructed Zero-shot Image-to-Text Generation and Visual Question Answering about BLIP2?

May 19 '23 06:05 gyula-coder

can I consider instructed image-to-text generation as vqa, and the the new in blip2 is zero-shot?

May 19 '23 07:05 gyula-coder