1day_1paper [4] Fast Convergence of DETR with Spatially Modulated Co-Attention (SMCA)

[4] Fast Convergence of DETR with Spatially Modulated Co-Attention (SMCA)

Open dhkim0225 opened this issue 3 years ago • 0 comments

DETR의 빠른 convergence 를 위해 SMCA (Spatially Modulated Co-Attention) 제안 108epoch 에서 coco17 val 45.6 AP 를 끌어냈는데 deformable DETR 이 44.9 를 끌어냈던 것을 생각하면 SMCA 좋은 성능을 보인다 할 수 있음. 다만,,, DETR 이 handcrafted 들을 모두 없앴는데, 점점 다시 hand-crafted feature 들이 늘어나는 것 같음. ~Deformable DETR 짱짱맨~

SMCA 방법론에 deformable DETR 에서 사용하던 방법 (self-attention 연산량 줄이기) 를 얹어서 사용할 수 있을 것으로 보이는데 같이 사용하면 어떻게 될 지 궁금함. (intra- multi-scale 방법들 대신 deformable 방식쓰면?)

SMCA

먼저, DETR 의 Query 가 수렴이 어렵다는 점에 집중을 함. object 중심을 예측하는 추가적인 operation 구성. 위 그림에서는 주황색 영역의 spatial prior를 뽑아낸다고 볼 수 있음. O_q 는 object query. MLP 는 2-layer FC. 이를 통해 2d gaussian 을 뽑아내게 됨 (0~1) beta 값은 hyperparmeter 뽑아낸 Gaussian weight에 log 붙여서 attention 구할 때 사용해주면 완료!

Multi-Head

multi-head 마다 추가적인 operation 을 둬서 다른 center를 뽑아보기도 했음 다음과 같은 center 변화량을 예측하도록 하고, head마다 다른 gaussian을 적용토록 한 것.

Multi-Scale Features

여기에 multi-scale input을 추가로 구성. 각 input 은 intra-scale self-attention 모듈을 통해 feature 별 self-attention이 수행됨. 이는 multi-scale self-attention이 cost가 크기 때문

Decoder에서는 FPN 과는 다르게 scale-selection attention weight를 생성해서 feature들을 적절히 혼합해줌. 이 녀석들 또한 object query로 예측이 되어짐

이럴 경우, key와 value 차원이 하나 더 늘어나기 때문에 SMCA 는 다음과 같이 적용됨.

Results

object query 100개 대신 300개 사용 Focal Loss 사용

이 논문을 읽으며 ablation이 궁금했음 Gaussian의 height, width 를 각각 구하는 것이 independent, 한번에 예측하는게 single 방식임 independence 에 multi-head 별로 gaussian을 구하는게 성능이 도움이 됨

다음은, intra-multi- self attention ablation. 아쉽게도 deformable detr 방식과의 비교는 없음. self attention의 weight는 share 하는 편이 좋음.

Oct 10 '21 03:10 dhkim0225

1day_1paper 1day_1paper copied to clipboard

[4] Fast Convergence of DETR with Spatially Modulated Co-Attention (SMCA)

SMCA

Multi-Head

Multi-Scale Features

Results

1day_1paper
1day_1paper copied to clipboard