vision [FEAT] Add MobileViT v1 & v2

🚀 The feature

As described in the RFC "Betteries includes, phase 3", I am working on adding MobileViT v1 and v2 inspired by the following code repos/snippets:

https://github.com/chinhsuanwu/mobilevit-pytorch
https://github.com/apple/ml-cvnets/blob/main/cvnets/models/classification/mobilevit_v2.py

The original paper can be found here.

Motivation, pitch

This has been decided in the RFC.

Alternatives

No response

Additional context

No response

cc @datumbox

Aug 12 '22 05:08 yassineAlouini

Looks great @yassineAlouini. It would be great if we get this implementation.

Please have a read at https://github.com/pytorch/vision/issues/5319 where we document some best practices for model authoring. Also to avoid licensing problems, let's do a from scratch implementation.

Aug 12 '22 07:08 datumbox

Perfect, I will work on this today but mostly next week and the week after. Will let you know how my progress goes. 👌

Aug 12 '22 08:08 yassineAlouini

I have started the implementation. Seems like a big chunk but excited to do it. :ok_hand:

I have found this huggingface implementation, could be useful as another inspiration: https://huggingface.co/docs/transformers/main/model_doc/mobilevit.

[EDIT] It looks like this is a wrapper around the https://github.com/apple/ml-cvnets implementation. :ok_hand:

Aug 18 '22 14:08 yassineAlouini

Hi @yassineAlouini. Just wanted to touch base on the implementation. Any blockers or need help?

Sep 14 '22 14:09 datumbox

Hello @datumbox, thanks for checking. So far, so good. It is taking a bit longer since I only had one day of working on it and it is paused for now but might work during the weekends and nights.

Do you expect a date for finishing? :thinking:

Sep 15 '22 08:09 yassineAlouini

hey @yassineAlouini, sounds good. Thanks for the work. There are absolutely no deadlines on our side; just checking that everything goes smoothly and that you don't have a blocker. Let me know if you need anything :)

Sep 15 '22 09:09 datumbox

Some update @datumbox: I will have some free time for the upcoming few days and should make some progress. Will let you know how it goes. 👌

Oct 29 '22 08:10 yassineAlouini

By the way, what is the PyTorch and TorchVision policies for the usage of einops? 🤔

Oct 30 '22 10:10 yassineAlouini

@yassineAlouini So far we don't have a model using this. Is there a specific use-case in MobileViT that can't be done otherwise?

Oct 31 '22 09:10 datumbox

I don't think it is irreplaceable, just wanted to check what is the best practice in torchvision. 👌 I will code everything using PyTorch and existing TorchVision code.

Oct 31 '22 09:10 yassineAlouini

One additional question regarding the TransformerEncoder: should I reimplement it or should I re-use the one from vision_transformer.py (i.e. EncoderBlock)? I was planning to copy-paste the code first, adapt it and then maybe later refactor. What do you think @datumbox?

Oct 31 '22 10:10 yassineAlouini

@yassineAlouini Makes sense. Let's start by copy-pasting and modifying and see what changes are needed. Then we can decide whether sharing components is worth it. :)

Oct 31 '22 10:10 datumbox

Some more progress @datumbox: I finally made the V1 work (I think), I am cleaning the code a bit and then will push it for a first round of reviews (to make sure I am on the right track). I will then focus on training the model to get the weights. Will let you know once it is pushed. Thanks in advance for your help!

Nov 17 '22 21:11 yassineAlouini

Alright, MobileVit (the v1 version) runs finally 🎉 and I have pushed the code, if you have some time @datumbox I would love to get few first comments. Thanks. 🙏

The PR is here: https://github.com/pytorch/vision/pull/6965/

I am starting the training step now and next will move to V2.

Nov 20 '22 16:11 yassineAlouini

Alright, I have tried running: torchrun --nproc_per_node=8 train.py --model mobile_vit_xxs to train a model on my Windows laptop but it seems not promising. I will try this on a cloud instance or on colab. @datumbox is there some available torchvision infra to do this or should I do it on my own? Thanks. :)

Nov 20 '22 21:11 yassineAlouini

@yassineAlouini thanks! I've responded on the PR, let's continue the discussion there. :)

Nov 24 '22 10:11 datumbox

Thanks @datumbox (et al) for the code review, I am checking now. 👌

Nov 26 '22 09:11 yassineAlouini

@datumbox @pmeier I am trying to make progress on this PR again. I need the ImageNet dataset to train the model and get the weights. I have sent a request to get it more than 10 days ago and still no dataset. Do you have another way to get the whole dataset? Thanks for your help!

Apr 08 '23 14:04 yassineAlouini

@yassineAlouini Given the license of ImageNet, there is no way for us to redistribute it. So I think we might have to wait for them to respond. :(

Apr 11 '23 08:04 datumbox

Unfortunately, I wouldn't get my hopes up:

Screenshot from 2023-04-11 10-54-13

Messaged them multiple times without a response ...

Apr 11 '23 08:04 pmeier

Thanks for the feedback @pmeier. I thought 10 days was a long time. :smile: Alright, I will try to finish the other points of the PR review and maybe ask someone on the torchvision team to do the training (if someone has the data) and then I can check the performance once the weights have been trained. :+1:

Apr 11 '23 09:04 yassineAlouini

@yassineAlouini We try to work something out with @pmeier. He will ping you on email. I also pinged on Twitter two of the people involved with ImageNet to see if they can help. We'll work something out. 🤞

Apr 11 '23 10:04 datumbox

There is a copy on Kaggle. https://www.kaggle.com/c/imagenet-object-localization-challenge/

Apr 11 '23 11:04 gau-nernst

Thanks for the link @gau-nernst but it is a smaller dataset if I am not wrong.

Apr 11 '23 12:04 yassineAlouini

@yassineAlouini I believe it is the ImageNet-1k split that most people commonly refer to as "the ImageNet dataset" (used in ILSVRC). It should be the correct one.

Otherwise, HuggingFace is also hosting ImageNet-1k here: https://huggingface.co/datasets/imagenet-1k

Apr 11 '23 13:04 gau-nernst

Thanks for the feedback and the link @gau-nernst. Isn't the "real" dataset the 22k one? Anyway, I will give it a try with the smaller one once I have time.

Sep 20 '23 14:09 yassineAlouini