efficient-attention
efficient-attention copied to clipboard
How to replicate attention maps in object detection
Can you share the code on how to visualize attention maps in object detection like the one shown in your paper?
Hi Chandler,
The visualization code was inside the code base of my company at that time. Because it was not part of this open-source project, I believe they will not release it. (I also no longer have access to it since I left the company.)
The logic is very simple though. We were visualization each channel in keys
. For keys
of shape [n, d_k, h, w]
, we slice it to n * d_k
tensors each of shape [1, 1, h, w]
. Since we were visualizing the softmax variant, each element is in the range (0, 1)
, which was easy to paint as a greyscale image.
Hi Chandler,
The visualization code was inside the code base of my company at that time. Because it was not part of this open-source project, I believe they will not release it. (I also no longer have access to it since I left the company.)
The logic is very simple though. We were visualization each channel in
keys
. Forkeys
of shape[n, d_k, h, w]
, we slice it ton * d_k
tensors each of shape[1, 1, h, w]
. Since we were visualizing the softmax variant, each element is in the range(0, 1)
, which was easy to paint as a greyscale image.
@cmsflash In the image above (Figure 3 in paper), the description is that it is the visualization of attention maps from the efficient attention module. Yet, you mentioned here that the visualization is only done only at the keys
.
I thought you visualized the attention maps from the output of the module.
@cmsflash In the image above (Figure 3 in paper), the description is that it is the visualization of attention maps from the efficient attention module. Yet, you mentioned here that the visualization is only done only at the
keys
.I thought you visualized the attention maps from the output of the module.
The description says the figure is visualizing the "global attention maps" from the efficient attention module. The "global attention maps" are the individual channels in keys
.
visualization is only done only at the
keys
@cmsflash I'm confused here. Isn't the global attention can only be extracted when we use softmax(QK.T/sqrt(dk))V
(or variations of it) ?
If only the channels at the keys
are visualized, then it is just the spatial information from the input image, no attention has been extracted yet.
The attention maps generated from QK
are the pixel-wise attention maps. In our terminology, the "global attention maps" are the individual channels in K
. Please check Section 3.4 in the paper for the reasoning behind the terminology.