SparseR-CNN
SparseR-CNN copied to clipboard
Why the network needs proposal features/boxes?
Thanks for your great work! I have a few questions. As described in the paper, the proposal feature is used as a sparse representation for obtaining objects from the feature map. It is used to achieve dynamic conv. and the dynamic conv. outputs the classification and regression results.
I'm wondering if the proposal feature is embedding and basically used for generating conv. params, why don't you directly use multi-branch convs, since 100 conv. branches are identical to 100 dynamic conv. with proposal feature in my opinion.
Moreover, proposal boxes are also suspicious. Since the paper mentioned iterative refinement and the feature is with the position information (coord. conv.), why not directly using the whole image as the boxes, which is the initialization method of proposal boxes in this code. That is to say, all boxes are from the whole image without proposal boxes, and processed by directly using multi-branch convs. With several rounds of refinement, I think it could also be regressed to the correct locations. In this way, the external embeddings of proposal features and boxes are no longer needed.
Any ideas?
Hi~ I think your idea is quite creative. It looks even simpler than Sparse R-CNN. Both proposal features and boxes are transformed in more elegant way. Have you carried out the experiment? And how is the result?
Thanks for your great work! I have a few questions. As described in the paper, the proposal feature is used as a sparse representation for obtaining objects from the feature map. It is used to achieve dynamic conv. and the dynamic conv. outputs the classification and regression results.
I'm wondering if the proposal feature is embedding and basically used for generating conv. params, why don't you directly use multi-branch convs, since 100 conv. branches are identical to 100 dynamic conv. with proposal feature in my opinion.
Moreover, proposal boxes are also suspicious. Since the paper mentioned iterative refinement and the feature is with the position information (coord. conv.), why not directly using the whole image as the boxes, which is the initialization method of proposal boxes in this code. That is to say, all boxes are from the whole image without proposal boxes, and processed by directly using multi-branch convs. With several rounds of refinement, I think it could also be regressed to the correct locations. In this way, the external embeddings of proposal features and boxes are no longer needed.
Any ideas?
good idea, but I think dynamic convs can be regarded as attention here, it can enhance the foreground features & suppress others belong to background, this may have difference effect with multi-branch naive convs.
I think this is a really good idea, since what dynamic convolution does, according to my understanding, is somehow similar to a stacked of two conv1d with stride 1. Are there any updates on the result of your idea?
Edit: I just found out that the proposal features are passed to self-attention instead of the roi-feature. It is completely different from just passing everything to parralel conv branch, since the weights of the convolutional layers are interacting with each other through self-attention. I wonder the reason why proposal features, not ROI features, are passed through self attention. Can anyone explain?