ImageCaptioning.pytorch icon indicating copy to clipboard operation
ImageCaptioning.pytorch copied to clipboard

Generate soft attention pictures of each word

Open Silence1995 opened this issue 6 years ago • 1 comments

Like the paper mentions, "As the model generates each word, its attention changes to reflect the relevant parts of the image." I'd like generate the soft attention pictures of each word, but met some problems. Is the script of eval.py can do the function? Or how to implement the function?

Best regards.

Silence1995 avatar Apr 30 '18 11:04 Silence1995

Hello, if you look into model's code you'll find that there is Attention class that calculates attention scores. It may be used slightly differently in different models though. So the easiest way to do it is to use forward hooks. This is an example for top-down model.

model = SomeModelAsInitializedUsually(**kwargs)
additive_attentions = []

def hook_att_map(self, input, output):
    global additive_attentions
    additive_attentions.append(output.cpu().data)

handler_attn = model.core.attention.alpha_net.register_forward_hook(hook_att_map)
<run model on some input>

After this, the additive_attentions will store attention maps for each word, so if the caption has 10 words there will be 10 attention maps. Note that they are 14*14 due to the model nature, you can change this value in resnet_utils.MyResnet class, but re-training will be required after it. One can stratch them to fit the original image size to produce pics like in the paper.

mojesty avatar Jun 24 '18 07:06 mojesty