pytorch-grad-cam icon indicating copy to clipboard operation
pytorch-grad-cam copied to clipboard

Video classification, Res+Transformer: Transformer input as feature vector, is it possible to visualize all 16 input frames?

Open ziyuleoliu opened this issue 2 years ago • 1 comments

hey author,

thanks for your amazing work.

For VIT we have to resize the (197-1) to 14*14 so we can treat it as a feature map of shape [14,14,dim].

However, I'm using a resnet+vanilla transformer for video classification task. The last conv layer of res18 [16, 512,7,7] (16 frames to represent a video) and the input to transformer is [16,512]. I want to visualize the pixel-level (all 16 frames) contribution of my transformer network. In this case i only have feature vectors as input of transformer network. If i treat the 16 -> 4*4, I can get only I'm wondering when we don't have a feature map, is it possible to visualize the heatmap using grad-cam?

Thanks in advance. Best

ziyuleoliu avatar Mar 02 '23 23:03 ziyuleoliu

Hi, Can you clarify what you mean by not having a feature map. You still have access to the resnet network, no? If so, if you have a model that does alll the steps in one forward pass (the resnet + transformer), you could target the feature maps from the resnet network.

jacobgil avatar Mar 15 '23 09:03 jacobgil