[`pipeline`] Add conditional text support for `ImageToTextPipeline`
What does this PR do?
This PR aims to add conditional generation support for image to text models. Sometimes if you guide the model with a prompt, you can achieve better results.
This PR also adds pix2struct on the supported models for ImageToTextPipeline. As most of pix2struct models uses conditional generation (for VQA models) and as we have a single class Pix2StructForConditionalGeneration that wraps both the VQA models and Image captioning models, I thought the best solution would be to simply add the conditional text support for ImageToTextPipeline.
The reason why a Pix2StructForVQA class is not implemented is that this model renders the question directly on the image, instead of feeding the text input into the model. Hence a single class Pix2StructForConditionalGeneration is needed and the changes will be done on the processor side that will take care of rendering the text on the image.
cc @NielsRogge @Narsil
The documentation is not available anymore as the PR was closed or merged.
I have doubts about this:
My biggest issue is with the expected I/O.
Pipeline are defined by I/O. In the particular case it's (image) -> (text) And here we're modifying to (image, text) -> (text).
This is ok, if and only if the extra text is always purely accessory, which doesn't seem to be the case.
-
Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.
-
All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).
- For instance for models that DO NOT handle extra text, the pipeline is going to start generating errors.
-
paddingandtruncationcannot be top-level parameters, we need to usetokenizer_kwargsinstead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.
In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.
For reference, the prompt reminds me of hypothesis_template within the zero-shot-classification. If a good sane default exists which can alleviate the I/O issue then this becomes a bit better.
Thanks for your comments! To reply to some of your doubts:
Can blip work without a prompt ? if not then it does not respect the pipeline I/O and cannot be used.
Definitely yes, what I meant here is that text is always accessory for most of the models (Blip, Pix2Struct trained on image captioning), however some Pix2Struct models, trained on VQA needs text inputs, but the text inputs are dealt in an unusual way (the question is directly rendered on the image)
All the swap logic seems highly specific and not really great ways to handle the logic (inspecting signature is bad in general).
Agreed on that, I have updated the checking logic for pix2struct, I also realized that vision-encoder-decoder models does not support conditional generation. However I believe that there is a fix that I can quickly address on another PR, if that PR gets merged, this pipeline would support text-conditioned image-to-text inference for all models. (EDIT: #22424)
padding and truncation cannot be top-level parameters, we need to use tokenizer_kwargs instead. The reason is that padding and truncation could mean thing towards images too, and to avoid confusion it's best splitting them altogether.
Agreed, this has been removed
In general I think we can reduce the complexity added in the PR a lot by removing a lot of introduced ifs.
I had to come up with this as the CI tests handles multiple input type (list of images, generators, etc.), I'll do my best to refactor this to make things simpler
@Narsil what's your opinion on my comment above? e.g. Pix2StructForConditionalGeneration solves both image captioning and VQA with the same model, using the same approach. You can, in addition to the input image, also feed a text prompt to either 1) guide the captioning 2) ask a question related to the image. In both cases, the model renders the prompt on top of the image to make the prediction.
Should we add Pix2StructForConditionalGeneration to both the image-to-text and VQA pipelines?
Is there any difference in the code between vqa and captionning ?
In general, pipelines are defined by I/O (input/output meaning (image, text) -> (text)). The rest is semantics, naming and goals.
For instance NER and token-classification are the same, and alias each other.
Is there any difference in the code between vqa and captioning ?
For models like BLIP-2 and Pix2Struct, the code is identical. For BLIP on the other hand, 2 different models are defined, but other than that the code is also identical (correct me if I'm wrong @younesbelkada).
I think adding an optional text prompt to the ImageToTextPipeline makes sense, however I wonder if that doesn't make the VQA pipeline obsolete for those models
however I wonder if that doesn't make the VQA pipeline obsolete for those models
Why does it ? You said above that both were correct ? Did I misunderstand something ?
Yes we can technically add them to both pipelines, if you are fine with that.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing in favor of #23362
@younesbelkada @NielsRogge I wonder this conditional text support for the ImageToTextPipeline models only enable inference stage or training stage as well ?