transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add TVLT

Open zinengtang opened this issue 2 years ago • 15 comments

What does this PR do?

Add TVLT to transformers. This PR implements a original version of the TVLT: Textless Vision-Language Transformer from the original paper from https://arxiv.org/abs/2209.14156 I have provided the model weights here https://huggingface.co/TVLT/tvlt-base

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Yes. Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? It is discussed on slack.
  • [x] Did you make sure to update the documentation with your changes? I have added the documentation. Please check if they make sense.
  • [x] Did you write any new necessary tests? Yes. Please check if they make sense.

Who can review?

Anyone.

zinengtang avatar Dec 11 '22 09:12 zinengtang

Thanks a lot for contributing this model! One first remark is that we tried to avoid all-capitals now in model classes to make all model names consistent across the library, so could you rename your classes to TvltModel, TvltConfig etc... (like BertModel, BertConfig).

Leaving you in the hands of @sanchit-gandhi and @ArthurZucker to finish cleaning the PR and all the tests.

sgugger avatar Dec 12 '22 14:12 sgugger

@NielsRogge @ArthurZucker I have a question. I converted the pixel_feature_extractor to imageprocessor class. Should I write a tester for imageprocessor but I couldn't find any example of it from other models. Feature extractor will be deprecated in v5 so I don't have to create one with mother class imageprocessor?

zinengtang avatar Dec 17 '22 14:12 zinengtang

Another question is that ProcessorMixin is multimodal processor but initialization requires tokenizer But TVLT only takes in vision and audio. Is there anyway or a processor that can handle this? Would it be good to propose to support combinations of two feature extractor in initialization of ProcessorMixin?

zinengtang avatar Dec 17 '22 15:12 zinengtang

Hey!

  1. With regards to testing, you should indeed add a tester. While the FeatureExctractor were replaced with ImageProcessors I think we are still using the test_feature_extraction_...
  2. I think it indeed makes sense to support multiple FeatureExtractors, and it might also be good to support multiple tokenizers. We could also create a MultiModalProcessorMixin class, but it might be a bit too big for this PR. I think the best solution for now is to just make your class inherit from PushToHubMixin and copy the relevant functions accordingly!

ArthurZucker avatar Dec 19 '22 12:12 ArthurZucker

With regards to testing, you should indeed add a tester. While the FeatureExctractor were replaced with ImageProcessors I think we are still using the test_feature_extraction_...

We can call them test_image_processing_xxx.py now, see #20716 as an example

I think it indeed makes sense to support multiple FeatureExtractors, and it might also be good to support multiple tokenizers. We could also create a MultiModalProcessorMixin class, but it might be a bit too big for this PR. I think the best solution for now is to just make your class inherit from PushToHubMixin and copy the relevant functions accordingly!

There's no need for a MultiModalProcessorMixin class I think, as the Processor class is exactly meant for multi-modal models. I think we just need to update processing_utils.py to handle the audio modality as well, cc @amyeroberts

NielsRogge avatar Dec 19 '22 13:12 NielsRogge

@NielsRogge @zinengtang Yes, a single TVLTProcessor class is the way to go to handle processing of multiple modalities. These processor classes already handle audio e.g. wav2vec2 so I don't think anything needs to be done to the ProcessorMixin. It should work provided attributes is modified e.g. for CLIPProcessor.

amyeroberts avatar Dec 20 '22 11:12 amyeroberts

The documentation is not available anymore as the PR was closed or merged.

A summary of what I have committed following the comments:

  1. I addressed the comments by all reviewers. (Let me know if I miss anything.)
  2. Checked all files and makefile and passed the auto checks.
  3. Currently, the model input names are pixel_value and audio_value. But audio models seem to be all using input_features. Should we update to input_features? But this makes inputs names combo a little more confusing. But I am good with either options.
  4. I added TvltForAudioVisualClassification. AudioVisualClassification could be good for many tasks like video emotion analysis or video-audio retrieval. For this, I am thinking about adding it as a new task. Audiovisual models is a growing community. This could be good for new audiovisual or vision+audio models in the future.

Let me know if you have any suggestions. @ydshieh @ArthurZucker @NielsRogge

zinengtang avatar Dec 26 '22 15:12 zinengtang

This is astounding work, @zinengtang!

You seem to be very familiar with the Transformers code base already :D amazing job.

I'll let @amyeroberts review the image processor, and asking our audio experts @sanchit-gandhi @ArthurZucker regarding the naming of the audio modality (whether we can go for audio_values or whether we should use something else).

After that should be good to merge!

Thanks for the review! :) You mentioned ViTMAE is more similar but it does not implement attention masks arguments. So, I reverted back to ViLT. Let me know if you have any other suggestions

zinengtang avatar Jan 15 '23 16:01 zinengtang

@zinengtang Thanks for this PR and adding this model! I've just made a few comments regarding the image processor. Overall looks good, mainly just a few formatting points.

Thanks so much for your review. They look good to me! I resolved all changes and you may check if you want. :)

zinengtang avatar Jan 16 '23 22:01 zinengtang

Hi @zinengtang, could you push a commit to trigger the CI? Seems like not all tests are run and many are failing.

After that, I'll assign one team member for a final review.

Thanks!

NielsRogge avatar Jan 18 '23 12:01 NielsRogge

Hi @zinengtang, thanks for the amazing job! Btw I guess that some files were accidentally added in the vilt model directory (https://github.com/huggingface/transformers/pull/20725/files#diff-5342d82acaa480e404377ccc91a49b5203a3119b02c0dac89bcf147cb32e950e). Could you please check?

image

j-min avatar Jan 18 '23 20:01 j-min

Please make your sure you address all comments (if there is a reason they don't work, please reply!) otherwise we won't be able to merge this.

Regarding this issue, I replied earlier that vitmae does not have implementation of attention_mask and similar models like videomae/vit also do not have. Therefore, I used ViLT.

zinengtang avatar Jan 25 '23 18:01 zinengtang

Regarding this issue, I replied earlier that vitmae does not have implementation of attention_mask and similar models like videomae/vit also do not have. Therefore, I used ViLT.

@amyeroberts here is my older comments. Let me know if you have any questions.

zinengtang avatar Jan 26 '23 18:01 zinengtang

@zinengtang Great, thank you for clarifying :)

amyeroberts avatar Jan 27 '23 14:01 amyeroberts

@sanchit-gandhi I have a question. How is it possible to move all masking code to TvltForPreTraining? The masking is done in encoding phase and therefore can only be done in TvltModel.

zinengtang avatar Jan 31 '23 06:01 zinengtang

Another question is that @amyeroberts suggests me to put the random part in feature extractor. But it seems wav2vec, as @sanchit-gandhi suggests, uses random in model forward phase.

zinengtang avatar Jan 31 '23 06:01 zinengtang

@zinengtang My suggestion was to make sure any randomness in the image processor could be controlled by the user -- a seed could be passed in and each instance has its own state -- however the randomness was already there. I agree with @sanchit-gandhi that moving it out into the modeling stage would be better and should make things cleaner :)

amyeroberts avatar Jan 31 '23 17:01 amyeroberts

Full motivations for moving the stochastic MAE masking into the modelling code can be found here: https://github.com/huggingface/transformers/pull/20725#discussion_r1090692949

@zinengtang I see your point! So the way we can arrange it is:

  • We compute all the masks/permutations inside TvltForPreTraining
  • We pass these masks/permutations to TvltModel

In general, we can try and keep as much of the MAE training code in TvltForPreTraining. Where we require it in TvltModel, we can add it accordingly!

sanchit-gandhi avatar Feb 01 '23 10:02 sanchit-gandhi

Thanks @NielsRogge I have addressed the comments. For audio-based VQA, it kinda depends on the users on how to use them but usually audio input should be the query/question. @amyeroberts @sanchit-gandhi thanks for the explanation! I have moved the masking generation to inside modeling file. Feel free to check if they match your expectation.

zinengtang avatar Feb 02 '23 02:02 zinengtang

@amyeroberts Hey thanks for your suggestions! But we cannot move masks to TvltForPreTraining since it has to be done in TvltModel. If we move masks from embedding to TvltModel, then we will have to break down TvltEmbeddings and directly use TvltPixelEmbeddings and TvltAudioEmbeddings. Let me know if these make sense.

zinengtang avatar Feb 08 '23 19:02 zinengtang

If we move masks from embedding to TvltModel, then we will have to break down TvltEmbeddings and directly use TvltPixelEmbeddings and TvltAudioEmbeddings.

@zinengtang Yes, I think it makes sense to directly use TvltPixelEmbeddings and TvltAudioEmbeddings instead of having the TvltEmbeddings class.

amyeroberts avatar Feb 08 '23 19:02 amyeroberts

OK I found that all my past reviews are 'pending' and maybe they were never sent out lol, which is my bad. Anyway, I addressed the comments and let me know if they make sense @amyeroberts.

zinengtang avatar Feb 09 '23 01:02 zinengtang

@NielsRogge what do you think about current state. Is there anything else left to address? Thanks!

zinengtang avatar Feb 09 '23 19:02 zinengtang

@NielsRogge now I addressed the remaining comments. It makes sense to me that TvltForQuestionAnswering is not needed since it is the same as TvltForAudioVisualClassification.

zinengtang avatar Feb 10 '23 17:02 zinengtang

@zinengtang Thanks for adding these final changes! I've left a comment on test_feature_extraction_tvlt.py on how to resolve one of the current issues on the circleCI run. The failing wav2vec2 tests have been resolved on main - rebasing should resolve them on this branch.

Once all the tests are green I think we're good to merge!

amyeroberts avatar Feb 14 '23 17:02 amyeroberts

@amyeroberts Sounds great. Btw there seems to be a fail from other models FAILED tests/models/hubert/test_modeling_tf_hubert.py::TFHubertRobustModelTest::test_dataset_conversion Do you think it comes from this branch or main branch?

zinengtang avatar Feb 14 '23 22:02 zinengtang

@zinengtang It's coming from main, not this branch :) I've just pushed a change on main - #21643 - which should resolve this temporarily for now. Could you rebase to add it to this branch?

amyeroberts avatar Feb 15 '23 14:02 amyeroberts

@amyeroberts Now it passed the tests! Thanks so much for the help/suggestions all the way. :)

zinengtang avatar Feb 15 '23 17:02 zinengtang

@zinengtang Thanks for all your work on adding this model, it's great to have it added to the repo! Make sure to spread the word about it being available :)

amyeroberts avatar Feb 15 '23 18:02 amyeroberts