GCNet icon indicating copy to clipboard operation
GCNet copied to clipboard

When taking videos as input

Open hshustc opened this issue 5 years ago • 3 comments

When taking videos input, the feature maps in each layer have four dimensions, i.e., THW*C. Are the attention maps are still query-independent? Could you please give more details? Thanks a lot.

hshustc avatar Jun 13 '19 06:06 hshustc

Up-vote for this question. I am really interested whether the attention maps in video task showed similar result like object detection task.

Kinda think that temporal dimension should have some more importance over spatial dimensions.

tea1528 avatar Jul 01 '19 01:07 tea1528

Sorry for the late reply

The attention across time is relative hard to visualize.

From the Table 1 in the paper, the attention on Kinetics seems to be a little more query dependent than COCO.
We will leave it as a future work.

xvjiarui avatar Jul 03 '19 04:07 xvjiarui

In my experiments in video classification,non local moudle is not query-independent. How about you guys' results?

JJBOY avatar Jul 10 '19 07:07 JJBOY