[42] Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

Open dhkim0225 opened this issue 3 years ago • 0 comments

Pyramid Vision Transformer(PVT) v2 백본 (#69)
multiscale feature를 deformable attention encoder로 결합
location decoder로 각 instance의 위치 정보를 찾아낸 다음 이 정보를 mask decoder에 쿼리로 입력해 마스크를 뽑는 방식

Introduction

Methods

3.1. Overall Architecture

backbone feature 들이 flatten 되어서 encoder 거침 (C3, C4, C5 는 각각 1/8, 1/16, 1/32 크기)
( thing query, (1) output ) 를 input 으로 받아서, location decoder 에서 bbox 찾아냄
( stuff query, (2) output, (1) output ) 를 input으로 받아서 mask decdoer 돌림

C3, C4, C5 를 flatten 시켜서 FC 를 하나 태우는데, channel size 는 256 으로 고정한다. 결과적으로 L1 x 256, L2 x 256, L3 x 256 크기의 맵들이 나온다. L_i 는 다음과 같이 정의된다.

3.2. Transformer Encoder

Deformable DETR 에서 제안한 encoding 방법 사용.

3.3. Decoder

3.3.1 Query Decoupling Strategy

지금까지처럼 things query와 stuffs query를 하나에 담는 건 비효율적이라 주장함. 좀 찾아봤는데, COCO 기준 아래 라벨맵에서 앞 80개가 things, 뒤 53개가 stuff 다. https://github.com/zhiqi-li/Panoptic-SegFormer/blob/38b0b46f6e36e0dac7c49065a28fdbd03ff29e9b/easymd/models/utils/visual.py#L90-L113

thing query 는 300개 stuff query 는 데이터셋마다 다름.

보라색 thing-query 에 대해서는 bipartite matching 을 주고, location 된 stuff-query 들은 fixed-matching 을 수행함.

3.3.2 Location Decoder

location decoder는 Deformable DETR과 동일하다. training 할 때는 location decoder 에 auxiliary MLP head 붙여서 Detection Loss L_{det} 추가로 뽑아냄.

location decoder는 꼭 box 를 만들 필요는 없음. object의 center of mass 를 예측하게도 만들어 봤는데, box 모델이랑 비슷한 성능을 내더라.

3.3.3 Mask Decoder

overall architecture 이미지의 (d) 에 해당하는 부분. segmentation mask 를 예측해야 해서 transformer decoder 날 것 그대로 사용했음.

thing query 를 mask decoder 에 통과시켜서 모든 "thing category" 를 예측해야 함. 그니까 그냥 segmentation 하듯이 prob map 을 예측하면 되는 것. category 별 pixel-wise logit 뽑고 Cross-Entropy 쓰면 됨.

sutff query 는 stuff 들만 예측하면 된다. 그니까 애초에 stuff 개수가 정해져 있으니 (coco 기준 53개) 그만큼의 query를 만들고 각각에 대한 probability map을 뽑아내는 방식.

DETR 논문에서 attention 때문에 수렴이 오래걸린다 해서 2가지 테크닉 적용.

ultra-light FC head 사용 (200 parameter)
Attention map supervision (attention map 이 mask 에 의해 supervised 됨. learning process 가속화)

mask 를 이용해서 attention map에 deep supervision을 준다? 무슨 소리고... 하니, attention map A의 총 크기는 N×h×(L1+L2+L3). N 은 N_{thing} + N_{stuff} 이다. coco 기준 353. h는 attention head 수. L1, L2, L3 는 encoder 쪽(위쪽) 에 설명해 놨음. 이 attention 들을 (N×h×L1, N×h×L2, N×h×L3) bilinear resize 로 크기 맞춰주고, FC 하나 통과시켜서 mask 예측하게 학습함.

3.4. Loss Function

λ_{things}, λ_{stuff} 는 이미지 별로 portion을 보고 조정되었음. 합은 1.

λ_{cls}, λ_{seg}, λ_{det} 는 각각 2, 1, 1 L_{det} 는 Deformable DETR loss, location decoder에 붙음 L_{seg} 는 dice loss L_{cls} 는 focal loss

3.5. Mask-Wise Merging Inference

pixel 별로 정답을 맞추는게 panoptic segmentation 인데, pixel 별 argmax 로 하나 할당시키니까 false positive 가 너무 많더라.

제안하는 방법은 다음과 같다. 먼저 confidence score s 를 구한다. p_i 는 i번째 result 의 class probability. m_i[h, w] 는 [h, w] 위치의 mask logit. α, β 는 classification probability, segmentation qualities 를 weighting 하는 parameter. 수식에서 볼 수 있듯이 mask prob 이 0.5 이상인 것만 사용함