Question about the sequence of decoder operations
Hi, I have a question regarding the sequence of decoder operations in your config file.
Based on your code, I guess the sequence of operations is as follows: self_attn -> cross_attn -> ffn -> Multiscaledeformableattention. However, when I read the paper, my understanding was that the sequence should be: MultiheadSelfAttention -> Multiscaledeformableattention -> ffn.
https://github.com/Sense-X/Co-DETR/blob/2d59a3038533d00732275a0f5d31cf5ff0b540ad/projects/configs/co_deformable_detr/co_deformable_detr_r50_1x_coco.py#L68C1-L89C56
Could you explain if I misunderstood the paper or code?
We follow the decoder design of Deformable-DETR. In Deformable-DETR, the correct decoder operation order is: self_attn -> cross_attn -> ffn. Specifically, self_attn is implemented by the MultiheadSelfAttention and cross_attn is the MultiscaleDeformableAttention.