GCNet
GCNet copied to clipboard
When taking videos as input
When taking videos input, the feature maps in each layer have four dimensions, i.e., THW*C. Are the attention maps are still query-independent? Could you please give more details? Thanks a lot.
Up-vote for this question. I am really interested whether the attention maps in video task showed similar result like object detection task.
Kinda think that temporal dimension should have some more importance over spatial dimensions.
Sorry for the late reply
The attention across time is relative hard to visualize.
From the Table 1 in the paper, the attention on Kinetics seems to be a little more query dependent than COCO.
We will leave it as a future work.
In my experiments in video classification,non local moudle is not query-independent. How about you guys' results?