[36] ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Open dhkim0225 opened this issue 3 years ago • 0 comments

가성비 detector를 만들자!

INTRO

그림에서 볼 수 있듯이, small size 에서 잘 동작하는 detector를 제안하는 논문.

DETR (ViT) 는 2가지 한계점을 갖고 있는데,

ViT 는 quadratic complexity with image-size
마찬가지로, attention 에서도 computation 어마어마함

patch token [PATCH] 들에 [DET] 토큰을 추가해 준 YOLOS 도 있음. 얘네도 ViT 기반이고, [DET] token 은 learnable 함. YOLOS 는 neck을 완전히 없애 버렸고, 꽤 efficient 함. 하지만, neck 을 없애버려서 multi-scale feature 나 auxiliary loss 를 활용하는 neck- 단에서의 테크닉들을 사용하지 못함.

neck을 다시 붙이는 VIDT 를 제안한다. contribution은 3개다.

효율적인 attention mechanism인 Reconfigured Attention Module (RAM) 제안.
Encoder-free 한 neck 설계
효율적인 knowledge-distillation 을 위한 token matching 제안

Preliminaries

ViT
DETR
YOLOS
- model-size 키우면 엄청 느려짐
- 추가적인 neck technique 적용 어려움
- DEiT backbone 쓰면 안 좋아짐

VIDT: VIsion and Detection Transformers

Reconfigured Attention Module (RAM)

Swin 의 patch reduction 과 local attention 방법론을 detection에 그대로 끌고 오는 것은 힘들다. 그 이유는,

[DET] token 수가 fixed-scale 로 유지되어야 함
[DET] token들 간의 locality 가 적음

이를 해결하기 위해 RAM 제안.

YOLOS 의 global attention 하나를 3개의 attention으로 decomposition 시킨다.

[PATCH]×[PATCH]
1. [PATCH] 토큰은 attention layer 를 거쳐감에 따라, global feature map 에서 key contents 들을 잡도록 조정 되어진다.
2. 기존 Swin Transformer 에서 하듯이 shifted window partitioning 수행.
3. [PATCH] 토큰 수는 각 stage 마다 4배 감소 (H/2, W/2) . 4개의 stage 를 지나면 H/4 x W/4 에서 H/32 x W/32 로 변함.
[DET]×[DET]
1. YOLOS 처럼 [PATCH] token에 대한 추가 input으로 100개의 learnable [DET] token을 추가.
2. [DET] 토큰 수는 감지할 개체의 수를 지정하므로 transformer layer에서 fixed-scale 로 유지. [DET] token 은 locality 가 없어서 global self-attention 수행.
3. 각 [DET] 토큰이 서로의 관계를 찾아내서 다른 object를 localization하는 데 도움이 됨 (중복제거 느낌)
[DET] × [PATCH]
1. [DET] token과 [PATCH] token 사이의 cross-attn
2. [DET] token 별로 object embedding 생성

ViDT는 [DET] × [DET] 및 [DET] × [PATCH] 주의를 묶어 한 번에 처리하여 효율성을 높였음 swin의 attention 을 위 그림처럼 대체했다고 보면 됨

Encoder-Free Neck Structure

Neck 쪽에 encoder 를 쓰면 [PATCH] x [PATCH] attention 때문에 cost가 커진다. ViDT에서 Neck에는 multi-layer deformable transformer를 쓴다. input은 2개이다.

각 swin stage 마다 생성된 [PATCH] token
last swin stage의 [DET] token

먼저 [DET] x [DET] attention 을 수행함. 이후, 새로운 [DET] token을 만들기 위해 multi-scale deformable attention 수행 M ==> head 수 L ==> multi-scale feature 개수 (4) K ==> sample 수 φ_l(p) ==> reference point ∆p_{mlk} ==> sampling offset A_{mlk} ==> attention weights

Auxiliary tasks

auxiliary decoding loss
1. DETR 에서 사용한 방식 그대로 차용
2. 2개의 FFN 을 모든 decoding layer에 박음
Iterative Box Refinement
1. Deformable DETR 에서 사용한 방식 그대로 차용
2. 각 decoding layer 는 전 decoding layer 의 output 을 refinement 함.

Knowledge Distillation with Token Matching for Object Detection

모든 ViDT 모델은 같은 수의 [PATCH] 와 [DET] token 을 갖고 있다. 이를 이용해서 자연스럽게 distillation 수행 가능. distillation loss l_{dis} 를 다음과 같이 구성한다. P ==> set of [PATCH] token D ==> set of [DET] token