What does this PR do?

This PR aims to add conditional generation support for image to text models. Sometimes if you guide the model with a prompt, you can achieve better results. This PR also adds pix2struct on the supported models for ImageToTextPipeline. As most of pix2struct models uses conditional generation (for VQA models) and as we have a single class Pix2StructForConditionalGeneration that wraps both the VQA models and Image captioning models, I thought the best solution would be to simply add the conditional text support for ImageToTextPipeline. The reason why a Pix2StructForVQA class is not implemented is that this model renders the question directly on the image, instead of feeding the text input into the model. Hence a single class Pix2StructForConditionalGeneration is needed and the changes will be done on the processor side that will take care of rendering the text on the image.

cc @NielsRogge @Narsil

Mar 28 '23 09:03 younesbelkada

The documentation is not available anymore as the PR was closed or merged.

Mar 28 '23 10:03 HuggingFaceDocBuilderDev

I have doubts about this:

My biggest issue is with the expected I/O.

Pipeline are defined by I/O. In the particular case it's (image) -> (text) And here we're modifying to (image, text) -> (text).

This is ok, if and only if the extra text is always purely accessory, which doesn't seem to be the case.

Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.
All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).
- For instance for models that DO NOT handle extra text, the pipeline is going to start generating errors.
padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

For reference, the prompt reminds me of hypothesis_template within the zero-shot-classification. If a good sane default exists which can alleviate the I/O issue then this becomes a bit better.

Mar 28 '23 10:03 Narsil

Thanks for your comments! To reply to some of your doubts:

Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.

Definitely yes, what I meant here is that text is always accessory for most of the models (Blip, Pix2Struct trained on image captioning), however some Pix2Struct models, trained on VQA needs text inputs, but the text inputs are dealt in an unusual way (the question is directly rendered on the image)

All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).

Agreed on that, I have updated the checking logic for pix2struct, I also realized that vision-encoder-decoder models does not support conditional generation. However I believe that there is a fix that I can quickly address on another PR, if that PR gets merged, this pipeline would support text-conditioned image-to-text inference for all models. (EDIT: #22424)

padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.

Agreed, this has been removed

In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.

I had to come up with this as the CI tests handles multiple input type (list of images, generators, etc.), I'll do my best to refactor this to make things simpler

Mar 28 '23 11:03 younesbelkada

@Narsil what's your opinion on my comment above? e.g. Pix2StructForConditionalGeneration solves both image captioning and VQA with the same model, using the same approach. You can, in addition to the input image, also feed a text prompt to either 1) guide the captioning 2) ask a question related to the image. In both cases, the model renders the prompt on top of the image to make the prediction.

Should we add Pix2StructForConditionalGeneration to both the image-to-text and VQA pipelines?

Apr 24 '23 09:04 NielsRogge

Is there any difference in the code between vqa and captionning ?

In general, pipelines are defined by I/O (input/output meaning (image, text) -> (text)). The rest is semantics, naming and goals.

For instance NER and token-classification are the same, and alias each other.

Apr 24 '23 12:04 Narsil

Is there any difference in the code between vqa and captioning ?

For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2 different models are defined, but other than that the code is also identical (correct me if I'm wrong @younesbelkada).

I think adding an optional text prompt to the ImageToTextPipeline makes sense, however I wonder if that doesn't make the VQA pipeline obsolete for those models

Apr 24 '23 14:04 NielsRogge

however I wonder if that doesn't make the VQA pipeline obsolete for those models

Why does it ? You said above that both were correct ? Did I misunderstand something ?

Apr 24 '23 15:04 Narsil

Yes we can technically add them to both pipelines, if you are fine with that.

Apr 24 '23 19:04 NielsRogge

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 19 '23 15:05 github-actions[bot]

Closing in favor of #23362

May 19 '23 15:05 younesbelkada

@younesbelkada @NielsRogge I wonder this conditional text support for the ImageToTextPipeline models only enable inference stage or training stage as well ?

Jul 12 '23 02:07 cramraj8

[`pipeline`] Add conditional text support for `ImageToTextPipeline`

What does this PR do?