torch-cam icon indicating copy to clipboard operation
torch-cam copied to clipboard

Add support of Vision Transformer

Open Yung-zi opened this issue 3 years ago • 6 comments

🚀 Feature

I am appreciated for your great job! However, I have a question. Can Layer-CAM be used with Vision Transformer Network? If it does work, what aspects should I change?

Motivation & pitch

I'm working on the job related to CAM.

Alternatives

No response

Additional context

No response

Yung-zi avatar Jan 22 '22 14:01 Yung-zi

Hello @Yung-zi :wave:

My apologies, I've been busy with other projects lately! As of right now, the library is designed to work with CNNs. However, the way it was designed basically only relies on forward activation and backpropagated gradient hooks. So to answer your question, I'd need to run some tests but if the output activation of a given layer is of shape (N, C, H, W), whatever the way it was computed as long as this doesn't break the backprop (i.e. being differentiable), the library should work without much (perhaps any) change :smile:

Either way, I intend on spending more time on Vision transformers compatibility for the next release :+1: If you're interested in helping / or providing feedback once it's in progress, let me know!

frgfm avatar Feb 01 '22 23:02 frgfm

Hello @Yung-zi 👋

My apologies, I've been busy with other projects lately! As of right now, the library is designed to work with CNNs. However, the way it was designed basically only relies on forward activation and backpropagated gradient hooks. So to answer your question, I'd need to run some tests but if the output activation of a given layer is of shape (N, C, H, W), whatever the way it was computed as long as this doesn't break the backprop (i.e. being differentiable), the library should work without much (perhaps any) change 😄

Either way, I intend on spending more time on Vision transformers compatibility for the next release 👍 If you're interested in helping / or providing feedback once it's in progress, let me know!

I am so sorry for late reply. I tried to change your code before. However, the effect looked not well maybe I made some mistakes. Have you ever made it on Vision transformer?

Yung-zi avatar Jul 08 '22 08:07 Yung-zi

Partially yes! But I have staged this for the next release anyway so I'll dive into it to make it available :)

frgfm avatar Aug 02 '22 19:08 frgfm

Quick update! As of today, here is the support status of Torchvision transformer architectures:

  • [x] maxvit
  • [x] swin
  • [x] swin_v2
  • [ ] vit (so far I can't see a way to make this integration seamless, because of the concatenation on the channel dimension and the dimension swapping)

frgfm avatar Dec 31 '22 00:12 frgfm

Another update: VIT requires another method called Attention flow! I'll try to investigate & implement this but this is a bit more complex than just inverting the axis swap & slicing.

frgfm avatar Jan 02 '23 21:01 frgfm