1day_1paper icon indicating copy to clipboard operation
1day_1paper copied to clipboard

[41] PVTv2: Improved Baselines with Pyramid Vision Transformer

Open dhkim0225 opened this issue 3 years ago • 0 comments

PVT (#68) 의 후속작. 6장짜리 간단한 논문

paper code

3가지 추가.

  1. overlapping patch embedding
  2. convolutional feedforward networks
  3. linear complexity attention layers

빠르게 본론 ㄱㄱ

Improved Pyramid Vision Transformer

Limitations in PVTv1

  1. Swin 같이 overlapping 하는게 좋은데 안하고 있었음
  2. positional encoding 크기가 fix 되어 있어서 arbitrary size inference 가 힘듦 (이건 swin 도 v2 (#59) 에서 풀어냈었음)
  3. high resolution input 넣으면 complexity 결국에는 커짐

Overlapping Patch Embedding

swin 과는 다른 방식. zero-padding 추가로 overlapping 구현 position encoding은 완전히 없애 버렸고, zero-padding 이 이를 대체한다고 말함. image

Convolutional Feed-Forward

3x3 depthwise conv 추가

Linear Spatial Reduction Attention.

image ㅋㅋㅋ 난 또 attention linear 하게 만들었다는 줄. 그냥 conv 하나 태우던거 pooling으로 교체.

Models

image

Results

Image classification

image

Object Detection

image

Instance Segmentation

coco val 2017 image

dhkim0225 avatar Dec 15 '21 04:12 dhkim0225