[100] MobileViT v1, v2, v3

Open dhkim0225 opened this issue 2 years ago • 0 comments

v1 paper v2 paper v3 paper v1, v2 code v3 code

재미있게도 v3는 저자 소속이 다르다. 합의는 된 걸까 ㅋㅋㅋ

v1 ==> ICLR 22, apple 논문 v2 ==> v1과 동저자, apple 논문, under review v3 ==> 다른 저자, micron 논문, under review

MobileViT V1

이 그림이 핵심이다. mobile vit block 을 제시. CNN 과 비슷하다 주장. cnn kernel 은 3x3 썼음

결국 local을 먼저 보고, global 하게 합쳐주는 느낌이다. large 하게 보아야 하는 self attn 효율화로도 볼 수 있다.

Settings

독특한 setting (swin setting 그런거 안 씀)
GPU: (NVIDIA ???) * 8
epochs: 300
batch-size: 1024
resolution: multiscale {(160, 160),(192, 192),(256, 256),(288, 288),(320, 320)}
Augmentation
- basic 하다고 표현함
- https://github.com/apple/ml-cvnets/blob/main/config/classification/imagenet/mobilevit.yaml
- random_resized_crop
- random_horizontal_flip
optimizer
- AdamW
- weight deacy: 0.01
- lr
  - 0.0002 => 0.002 warmup 3k iteration
  - cosine decay
EMA

family

MobileViT-XXS, MobileViT-XS, MobileViT-S 3개 만듦 https://github.com/micronDLA/MobileViTv3/blob/main/MobileViTv3-v1/cvnets/models/classification/config/mobilevit.py 이쪽 config 참조.

multi scale sampling

기존 ViT 계열은 input size 가 달라지면 positional encoding interpolation 취해야 함 (size 별 finetuning 필요) multi scale 을 수행하는데, 특이하게, 작은 resolution iteration 에 대해서는 큰 batch-size 를 취하도록 알고리즘을 짰다. $H_n, W_n$ 은 나올 수 있는 최대 resolution. $H_t, W_t$ 가 sample 된 resolution 크기이다. 성능에 어떤 영향을 끼치는지 자세히 안 써있다 흠.. 분명히 뭔가 다를텐데, 직접 돌려봐야 할 것 같다.

Ablations

weight decay 에 크게 영향 안 받고,

skip connection에서 0.5 정도 성능향상

patch size 가 critical 하다. 3, 3, 3 이 좋긴 했는데, folding, unfolding 에서 interpolation 해야해서 느림. 최종 2, 2, 2 사용

label smoothing 에서 0.3

MobileViT V2

self attention 땜시 느리다. Separable self-attention 모듈 제안.

아이폰 12 에서 재 봤는데, 속도 괜찮더라

추가적으로, stem 쪽은 DW-conv 를 쓰고, skip-connection, fusion block 을 안 썼다고 한다. 성능에 비해 속도가 많이 빨라졌다고 한다. 이쪽 관련 그림은 V3 그림을 참고하면 좋을 듯 하다. scaling 은 width multiplier $\alpha \in {0.5, 2.0}$ 를 활용했다고 한다.

Settings

GPU: (NVIDIA ???) * 8
epochs: 300
batch-size: 1024
resolution: (256, 256)
Augmentation
- v1 에 비해 엄청 강해짐
- https://github.com/apple/ml-cvnets/blob/main/config/classification/imagenet/mobilevit.yaml
- random_resized_crop
- random_horizontal_flip
- rand_augment: torchvision default
- random_erase: 0.25
- mixup: alpha 0.2
- cutmix: alpha 1.0
- resize: size 288 bicubic interp
- center_crop: size 256
optimizer
- AdamW
- weight deacy: 0.01
- lr
  - 0.000002 => 0.002 warmup 20k iteration
  - cosine decay
EMA

MobileViT V3

Block 개선. 이것저것 variant 들을 실험했다. 닮지만은 않았지만, 약간 hourglass 가 생각난다.

Fusion 쪽 conv 3x3 => 1x1

local, global 쪽 feature fusing 을 하는 게 굳이 다른 position까지 봐 가면서 할 필요 없다고 주장. 오히려 작업을 단순화하는게 성능향상에 유효할 것이라는 motivation.

Local and Global features fusion

local representation 은 안 썼었는데, 이 녀석까지 concat 해서 쓰는게 낫다고 주장. input feature 들에 비해서는, local feature 가 global feature 와 더 연관성이 깊기 때문이라 한다. 3x3 => 1x1 덕에 얻은 computation 을 이쪽에 쓴다.