pytorch-grad-cam
pytorch-grad-cam copied to clipboard
Strange! The last block doesn't get the right answer, the others can.
Hi, when I was visualizing the ViT, I couldn't get the correct visualization when the last block was the target layer. The output gradient is shown in the figure below. Hope you can answer it. (Correct results can be obtained for other blocks, I used class token as the classification feature)


What is the exact layer that you used ?
The output from ViT is composed of tokens + the cls token. The classification is done on the cls token. This means that the other tokens from the last layer, are not connected to the output - they won't work. When you go one layer back, the spatial tokens are connected to the output (through the cls token in the layer above).

This is my network when I was testing, last blocks is just a transformer block, there are three transformer blocks in the encoder, I added other modules before last blocks when I trained, thank you very much for your reply.