DAB-DETR
DAB-DETR copied to clipboard
Why modulating attention by w&h works?
I have some doubts on line https://github.com/IDEA-opensource/DAB-DETR/blob/main/models/DAB_DETR/transformer.py#L242 .
refHW_cond = self.ref_anchor_head(output).sigmoid() # nq, bs, 2
This line asks the model to learn absolute value of w, h from output. But NO supervision is applied. Besides, the 'output' tensor is used to learn the OFFSET of bbox (x, y, w, h).
So, I am wondering whether the model can learn width and height as expected?
The results show that our models get performance gains with the modulated operation.