What does this PR do?

Fixes #20663

Paper: https://arxiv.org/pdf/2210.03347.pdf Code: https://github.com/google-research/pix2struct

Pix2Struct is a series of Image-text models that has been fine-tuned on various datasets and tasks. Screenshot 2023-03-10 at 09 42 19

This integration will offer users variety of models and potential use cases

Pix2Struct is a model that combines vision encoder and text decoder, similar as T5. The method heavily relies on its image processing procedure. The image pre-proccessing differs from classic Vision Transformers by being able to handle images of variable resolution, thus being able to keep the aspect ratio of the original image, that seems to be essential and crucial for image understanding.

Screenshot 2023-03-10 at 09 47 12

Therefore I decided to change the current paradigm for getting pixel_values differently. Now the pixel values should be seen as tokens that are directly processed by the ImageProcessor. Hence, I decided to change pixel_values to pixel_embeds , as in fact they correspond to the image embeddings. We now obtain the patch embeddings directly from the processor, that is also responsible of also computing the pixel embeds attention mask.

I will update all the weights (18 in total) after I get 1 approval

TODO

FIne-tuning notebook

Feb 01 '23 11:02 younesbelkada

The documentation is not available anymore as the PR was closed or merged.

Feb 01 '23 15:02 HuggingFaceDocBuilderDev

@younesbelkada @ArthurZucker 👋 how is this PR going? Do you need some help to get it over the finish line? Happy to collab if helpful.

Mar 08 '23 17:03 ankrgyl

Hi @ankrgyl Thanks so much for proposing your help on this PR!

I fixed now few tests related to batched generation and addressed most of @ArthurZucker 's comments. The architecture is completely ready to use if someone wants to perform conditional and unconditional image captionning! I wanted to work on a fine-tuning notebook similar as this one: https://colab.research.google.com/drive/1lbqiSiA0sDF7JDWPeS0tccrM85LloVha?usp=sharing as it boosts quite a lot the usage of the model ! IMO the things that are left are: 1- Making a notebook for Pix2Struct using the base model (that is currently pushed here: https://huggingface.co/ybelkada/pix2struct-textcaps-base 2- Address the last comments 3- Push the correct conversion script 4- Push the remaining weights (I can do that only after one approval) If you want, you can help me on 1, if you have some doubts about your modification you can just run the integration tests:

RUN_SLOW=1 pytest tests/models/pix2struct/test_modeling_pix2struct.py::Pix2StructIntegrationTest

and make sure they pass!

I am aiming to merge this at most by beginning of next week ! Let me know if you want to help on those, otherwise happy to continue the PR 💪

Mar 09 '23 19:03 younesbelkada

It looks like you've got it under control so I'll bow out, but happy to test!

Mar 09 '23 20:03 ankrgyl

I think I have addressed most of the comments! I also updated the PR description and would love to have a round of review! cc @amyeroberts @ArthurZucker

Mar 10 '23 09:03 younesbelkada

Thanks @amyeroberts for the extensive review! Should have addressed most of them and left some open questions Regarding the new naming patches, I am not 100% convinced about that, users needs to see these input as a new paradigm that is equivalent to text tokens (as there are also attention masks in this new input) but applied to images, and I am afraid patches will confuse users as the shape of this input would be hard to interpret bs x seq_len x hidden_dim (with hidden_dim=num_channels x patch_width x patch_height.)

Mar 13 '23 09:03 younesbelkada

As disccused offline, let's stick for flattened_patches ! I should have fixed your comments by now and added support for vqa models in Pix2struct as they require a specific format / way of inferring

Mar 14 '23 10:03 younesbelkada

Thanks a mile for the extensive review! 🚀 So from what I have got from your comment: https://github.com/huggingface/transformers/pull/21400#discussion_r1137690655 I removed the data_format argument Would love to have a last round of review 💪

Mar 17 '23 14:03 younesbelkada

transformers
transformers copied to clipboard

Add Pix2Struct

What does this PR do?

TODO

transformers transformers copied to clipboard

Add Pix2Struct

What does this PR do?

TODO

transformers
transformers copied to clipboard