1day_1paper [2] Deformable DETR: Deformable Transformers for End-to-End Object Detection

[2] Deformable DETR: Deformable Transformers for End-to-End Object Detection

Open dhkim0225 opened this issue 3 years ago • 0 comments

백신 공가 이틀 사용 후 복귀. (21.10.05 - 21.10.06) 포스트가 잘 이해가지 않는다면 #28 을 먼저 볼 것.

Deformable DETR

paper code

DETR 이 hand-designed feature를 없애고 좋은 성능을 내지만 2가지 단점이 있음

느린 수렴속도
limited feature space resolution (quadratic 하게 증가하는 attention 계산량)

이를 위해서 deformable DETR 제안.

small set을 갖고 efficient attention 수행
multi-scale feature 활용

Deformable Attention Module

Deformable Attention 부터 이해해보자

x ==> input feature (CHW) q ==> query index p_q ==> 2d reference point z_q ==> content feature M ==> Num of attention head K ==> Num total sampled key

M = 8 and K = 4 가 default 값으로 사용됨

이를 그림으로 나타내면 다음과 같다.

query feature z_q 를 Linear 태워서 offset 을 구하고 또다른 Linear 를 태워서 attention weight를 구한다. query 만으로 attention weight를 구하는 방식이 lambda network 나 se-module 과 닮았다고 생각할 수 있다.

이것을 약간 더 확장해서 multi-scale 로 가져갈 수 있다.

L ==> feature level φ ==> normalized P^ 포인트들을 원본에 맞게 re-scale 해주는 function

이것만 이해하면 나머지 내용은 쉽게 이해 가능하다.

Encoder

ResNet 의 C3, C4, C5, C6 feature map 4개가 encoder input으로 활용됨 input, output featuremap 크기는 동일함 query 별 reference point 는 encoder에서는 해당 point 자체가 됨. 각 query pixel이 어떤 feature level 을 사용하고 있는지 명확히 noti 주기 위해서 scale-level embedding을 더해줌 scale-level embedding은 randomly initialized 되고 jointly 학습됨.

Decoder

Decoder 에 self-attention 부분은 변하지 않음. cross-attention 부분에는 제안한 모듈을 그대로 적용됨 reference point P^ 은 object-query embedding을 linear-sigmoid 태워서 얻어냄 (0~1) 해당 reference point 를 featuremap size 에 맞게 원복시켜서 offset 얻어냄

Additional Techniques

성능향상을 위해 추가적으로 2가지를 넣었다.

Iterative BBOX Refinement

각각의 decoder layer 가 previous layer 의 box output 을 refine 하는 방식

Two-Stage Deformable DETR

Encoder 에서 region proposal 을 내뱉는 방식. 각 feature point 가 object query 로써 Region Proposal 을 수행함. NMS 적용하지 않고 바로 second stage 로 넘겨줌

Results

아쉽게도 속도는 더 느리다.

FPN 도 붙여보았는데, 이 구조에서는 그닥 필요 없음. OCR 같이 multi-scale 이슈가 큰 도메인에서는 FPN 실험을 따로 진행해 보는 편이 좋을 것으로 보임.

Oct 07 '21 05:10 dhkim0225

1day_1paper 1day_1paper copied to clipboard

[2] Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable DETR

Deformable Attention Module

Encoder

Decoder

Additional Techniques

Iterative BBOX Refinement

Two-Stage Deformable DETR

Results

1day_1paper
1day_1paper copied to clipboard