dinov2 Notebook for attention map over input image

Attention heatmap visualization is a common utility that will likely serve several researchers.

In order to implement it, it requires some subtle code changes to fundamental classes that many researchers might wish to have already implemented for convenience.

Inspired by a working implementation from here, I also took further steps of figuring out how to load pre-trained models with registers ("Vision Transformers Need Registers"), which indeed resolves curious artifacts with some background attention tokens.

I've also cleaned up code substantially, provided a simple example on a cool NASA space shuttle launch from Wikimedia Commons, and introduced a nice subtle visualization of the attention mask directly on top of the original image.

I hope this helps several researchers and developers!

This pull request addresses or resolves the following:

https://github.com/facebookresearch/dinov2/issues/294
https://github.com/facebookresearch/dinov2/issues/285
https://github.com/facebookresearch/dinov2/issues/177
https://github.com/facebookresearch/dinov2/issues/90
https://github.com/facebookresearch/dinov2/issues/69

P.S. I haven't made many pull requests, and didn't want to mix up with https://github.com/facebookresearch/dinov2/pull/305, so I forked two different repositories, but in the future will just create branches for pull requests. Thanks!

Nov 12 '23 05:11 legel

Hi Iegel,

I find an error happens at attention.ipynb: In Cell [8]: attentions = attentions[0, :, 0, 1+n_register_tokens:].reshape(number_of_head, -1) IndexError: too many indices for tensor of dimension 3

So I checked the attention shape. It turns out to be [1, 4629, 768] instead of [1, 12, 4629, 4629] in your notebook. I know the 768 is the embedding dimension of base model. Why my attention results have a different dimension from yours? Thank you.

Nov 25 '23 11:11 LichunZhang

Hi @LichunZhang my best guess is that one of your core files did not get changed properly, so the model is still only feed-forwarding the 768 dimensional features, instead of the full attention...

I would double check that you've cloned the repository directly from https://github.com/3cology/dinov2_with_attention_extraction/tree/main and then run the notebook in that repository. Feel free to share the output here and any further insights.

Nov 26 '23 03:11 legel

Hi @LichunZhang my best guess is that one of your core files did not get changed properly, so the model is still only feed-forwarding the 768 dimensional features, instead of the full attention...

I would double check that you've cloned the repository directly from https://github.com/3cology/dinov2_with_attention_extraction/tree/main and then run the notebook in that repository. Feel free to share the output here and any further insights.

Thank you for the quick response. I do clone the whole repository directly from https://github.com/3cology/dinov2_with_attention_extraction/tree/main and the attention.ipynb still returns [1, 4629, 768] attentions. Could you please check the issue? Maybe some files in the main branch changed after your fork?

attention

Nov 26 '23 23:11 LichunZhang

Hi @LichunZhang my best guess is that one of your core files did not get changed properly, so the model is still only feed-forwarding the 768 dimensional features, instead of the full attention... I would double check that you've cloned the repository directly from https://github.com/3cology/dinov2_with_attention_extraction/tree/main and then run the notebook in that repository. Feel free to share the output here and any further insights.

Thank you for the quick response. I do clone the whole repository directly from https://github.com/3cology/dinov2_with_attention_extraction/tree/main and the attention.ipynb still returns [1, 4629, 768] attentions. Could you please check the issue? Maybe some files in the main branch changed after your fork?

I think it happens because you are using the xFormers library, which uses MemEffAttention by default (https://github.com/3cology/dinov2_with_attention_extraction/blob/main/dinov2/layers/attention.py#L77) instead of standard Attention (https://github.com/3cology/dinov2_with_attention_extraction/blob/main/dinov2/layers/attention.py#L36). Note that MemEffAttention does not return the attention matrix, as Attention module does, but only the new representation for x.

Since MemEffAttention module does not store the attention matrix at all (see https://github.com/facebookresearch/xformers/issues/730#issuecomment-1518740489) , you need to use the Attention module to plot the saliency map.

It should work if at the beginning of the notebook you set something like os.environ["XFORMERS_DISABLED"] = '0'

Nov 30 '23 08:11 riccardorenzulli

Hi! I stumbled on the same issue - the output of my attentions is a tensor with 3 dimensions: attentions.shape=torch.Size([1, 329, 384]).

Checked that this happens with both xformers and without (export XFORMERS_DISABLED=True).

Dec 04 '23 17:12 alexaatm

Update: confirmed that it happens because of xformers enabled. Before I must haved overlooked it.. Solved now:)

Dec 04 '23 18:12 alexaatm

I solved the issue now. Refer to #90 and find ludles's answer. It turns out that we should modify the code of MemEffAttention .

Dec 04 '23 20:12 LichunZhang

dinov2 dinov2 copied to clipboard

Notebook for attention map over input image

dinov2
dinov2 copied to clipboard