GCNet When taking videos as input

When taking videos as input

Open hshustc opened this issue 5 years ago • 3 comments

When taking videos input, the feature maps in each layer have four dimensions, i.e., THW*C. Are the attention maps are still query-independent? Could you please give more details? Thanks a lot.

Jun 13 '19 06:06 hshustc

Up-vote for this question. I am really interested whether the attention maps in video task showed similar result like object detection task.

Kinda think that temporal dimension should have some more importance over spatial dimensions.

Jul 01 '19 01:07 tea1528

Sorry for the late reply

The attention across time is relative hard to visualize.

From the Table 1 in the paper, the attention on Kinetics seems to be a little more query dependent than COCO.
We will leave it as a future work.

Jul 03 '19 04:07 xvjiarui

In my experiments in video classification，non local moudle is not query-independent. How about you guys' results?

Jul 10 '19 07:07 JJBOY

GCNet GCNet copied to clipboard

When taking videos as input

GCNet
GCNet copied to clipboard