transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add VATT model

Open johko opened this issue 2 years ago • 7 comments

Model description

Hey, as discussed with @NielsRogge a few weeks back, I'd like to work on adding the "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text" model from Google.

It is basically three transformers(Video/Audio/Text) that are trained jointly in an unsupervised manner using contrastive loss functions. For downstreams tasks they fine-tune the Transformers separately, but also explore a version that shares the weights for all modalities.

For Pre-Traning they use text-video-audio triplets from HowTo100M and video-audio pairs from AudioSet. The authors describe how to fine-tune VATT for vision and audio classification tasks and provide weights for the fine-tuned versions.

The backbone for vision is ViT, for audio WaveFormTransformer and for text they are using BERT/T5

Open source status

  • [X] The model implementation is available
  • [X] The model weights are available

Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2104.11178.pdf GitHub: https://github.com/google-research/google-research/tree/master/vatt

johko avatar Oct 25 '22 08:10 johko

@johko have you started implementing it?

fcakyon avatar Nov 20 '22 18:11 fcakyon

@fcakyon yes I have started, but progress is still rather slow, as that is my first model contribution and I have to figure out some stuff.

johko avatar Nov 22 '22 16:11 johko

@johko I totally understand it. Interested in your implementation since I will be using VATT in my research next year :)

Are you working on a TF implementation?

fcakyon avatar Nov 22 '22 16:11 fcakyon

@johko I totally understand it. Interested in your implementation since I will be using VATT in my research next year :)

Are you working on a TF implementation?

Sorry for the late reply (again 🙈). Yes I'm working on a TF implementation. As the original repo is using it, I'm first doing that and then see about pytorch.

johko avatar Nov 27 '22 10:11 johko

@johko, thanks for the response! I may also help with the pytorch part once you finalize the TF implementation 👍

fcakyon avatar Nov 27 '22 11:11 fcakyon

@fcakyon that would be great, as my expertise is more in TF 🙂

johko avatar Nov 27 '22 11:11 johko

Hey @NielsRogge , I'm sorry but I think I have to stop working this for good. I'd love to finish it, but every time I think now I finally have some time to do it, something else comes around :disappointed:

I think I just can't provide a big contribution like this atm and will rather focus on smaller things. But maybe @fcakyon wants to pick up on it.

Sorry for blocking this so long.

johko avatar Jan 24 '23 11:01 johko

any news about VATT PyTorch implementation ?

pretbc avatar Sep 20 '23 09:09 pretbc