UNI
UNI copied to clipboard
How was figure 3e generated in the paper?
I used something similar to this to extract the attention scores for the penultimate layer, as explained in the caption for figure 3e. However, I found that the attention maps I'm getting are a lot less "intuitive" compared to the ones shown in this figure.
Was this figure generated with a fine-tuned UNI model on the ROI level task or is it just showing the attention maps of the SSL model (no fine-tuning)?
Also, are the 448^2, 896^2 and 1344^2 attention maps computed by concatenating the attention map for each non-overlapping 224^2 patch together?