ImageCaptioning.pytorch
ImageCaptioning.pytorch copied to clipboard
Generate soft attention pictures of each word
Like the paper mentions, "As the model generates each word, its attention changes to reflect the relevant parts of the image." I'd like generate the soft attention pictures of each word, but met some problems.
Is the script of eval.py
can do the function? Or how to implement the function?
Best regards.
Hello, if you look into model's code you'll find that there is Attention
class that calculates attention scores. It may be used slightly differently in different models though.
So the easiest way to do it is to use forward hooks. This is an example for top-down model.
model = SomeModelAsInitializedUsually(**kwargs)
additive_attentions = []
def hook_att_map(self, input, output):
global additive_attentions
additive_attentions.append(output.cpu().data)
handler_attn = model.core.attention.alpha_net.register_forward_hook(hook_att_map)
<run model on some input>
After this, the additive_attentions
will store attention maps for each word, so if the caption has 10 words there will be 10 attention maps. Note that they are 14*14 due to the model nature, you can change this value in resnet_utils.MyResnet
class, but re-training will be required after it. One can stratch them to fit the original image size to produce pics like in the paper.