transformers
transformers copied to clipboard
Add Pix2Struct
What does this PR do?
Fixes #20663
Paper: https://arxiv.org/pdf/2210.03347.pdf Code: https://github.com/google-research/pix2struct
Pix2Struct
is a series of Image-text models that has been fine-tuned on various datasets and tasks.
This integration will offer users variety of models and potential use cases
Pix2Struct
is a model that combines vision encoder and text decoder, similar as T5. The method heavily relies on its image processing procedure. The image pre-proccessing differs from classic Vision Transformers by being able to handle images of variable resolution, thus being able to keep the aspect ratio of the original image, that seems to be essential and crucial for image understanding.
Therefore I decided to change the current paradigm for getting pixel_values
differently. Now the pixel values should be seen as tokens that are directly processed by the ImageProcessor
. Hence, I decided to change pixel_values
to pixel_embeds
, as in fact they correspond to the image embeddings. We now obtain the patch embeddings directly from the processor, that is also responsible of also computing the pixel embeds attention mask.
I will update all the weights (18 in total) after I get 1 approval
TODO
- FIne-tuning notebook
The documentation is not available anymore as the PR was closed or merged.
@younesbelkada @ArthurZucker 👋 how is this PR going? Do you need some help to get it over the finish line? Happy to collab if helpful.
Hi @ankrgyl Thanks so much for proposing your help on this PR!
I fixed now few tests related to batched generation and addressed most of @ArthurZucker 's comments. The architecture is completely ready to use if someone wants to perform conditional and unconditional image captionning! I wanted to work on a fine-tuning notebook similar as this one: https://colab.research.google.com/drive/1lbqiSiA0sDF7JDWPeS0tccrM85LloVha?usp=sharing as it boosts quite a lot the usage of the model ! IMO the things that are left are: 1- Making a notebook for Pix2Struct using the base model (that is currently pushed here: https://huggingface.co/ybelkada/pix2struct-textcaps-base 2- Address the last comments 3- Push the correct conversion script 4- Push the remaining weights (I can do that only after one approval) If you want, you can help me on 1, if you have some doubts about your modification you can just run the integration tests:
RUN_SLOW=1 pytest tests/models/pix2struct/test_modeling_pix2struct.py::Pix2StructIntegrationTest
and make sure they pass!
I am aiming to merge this at most by beginning of next week ! Let me know if you want to help on those, otherwise happy to continue the PR 💪
It looks like you've got it under control so I'll bow out, but happy to test!
I think I have addressed most of the comments! I also updated the PR description and would love to have a round of review! cc @amyeroberts @ArthurZucker
Thanks @amyeroberts for the extensive review! Should have addressed most of them and left some open questions
Regarding the new naming patches
, I am not 100% convinced about that, users needs to see these input as a new paradigm that is equivalent to text tokens (as there are also attention masks in this new input) but applied to images, and I am afraid patches
will confuse users as the shape of this input would be hard to interpret bs x seq_len x hidden_dim
(with hidden_dim=num_channels x patch_width x patch_height
.)
As disccused offline, let's stick for flattened_patches
! I should have fixed your comments by now and added support for vqa
models in Pix2struct
as they require a specific format / way of inferring
Thanks a mile for the extensive review! 🚀 So from what I have got from your comment: https://github.com/huggingface/transformers/pull/21400#discussion_r1137690655 I removed the data_format
argument
Would love to have a last round of review 💪