self-critical.pytorch
self-critical.pytorch copied to clipboard
To use m2transformer
The m2transformer in this repo will get worse results than the transformer implemented in this repo, and also worse than what's in the m2transformer paper.
In short, the difference is caused by the framework, not the model architecture. That is to say, if you port my transformer into their codebase, you can also better results than m2transformer. If you are curious, please checkout (https://github.com/ruotianluo/meshed-memory-transformer/tree/mytransformer), and look at the description of the commit.
I have been trying to rule out what part of their codebase leads to higher performance. It has been narrowed down to the dataloader and cider computation they use. (So, not learning rate, not beam search self-critical, not gradient clipping, etc). I have been able to port their dataloader and cider computation into my codebase, and I can fairly say I am able to reproduce the results given by m2transformer.
I am keeping digging deeper to see what specific is changing the game. Will update here.
I still appreciate their codebase. The design is much more modern than mine.
Hi @ruotianluo. Thanks for your investigation. December last year I implemented the meshed decoder part on your codebase, by changing the DecoderLayer to:
class DecoderLayer(nn.Module):
def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 3)
self.fc_alpha1 = nn.Linear(d_model * 2, d_model)
self.fc_alpha2 = nn.Linear(d_model * 2, d_model)
self.fc_alpha3 = nn.Linear(d_model * 2, d_model)
def forward(self, x, memory, src_mask, tgt_mask):
# memory of shape: (batch_size, num_layers, num_boxes, d_model)
query = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
enc_att1 = self.src_attn(query, memory[:, 0], memory[:, 0], src_mask)
enc_att2 = self.src_attn(query, memory[:, 1], memory[:, 1], src_mask)
enc_att3 = self.src_attn(query, memory[:, 2], memory[:, 2], src_mask)
alpha1 = torch.sigmoid(self.fc_alpha1(torch.cat([query, enc_att1], -1)))
alpha2 = torch.sigmoid(self.fc_alpha2(torch.cat([query, enc_att2], -1)))
alpha3 = torch.sigmoid(self.fc_alpha3(torch.cat([query, enc_att3], -1)))
sum_src_attn = (enc_att1 * alpha1 + enc_att2 * alpha2 + enc_att3 * alpha3) / np.sqrt(3)
out_sum = self.sublayer[1](sum_src_attn, None)
return self.sublayer[2](out_sum, self.feed_forward)
The original transformer with 3 layers can actually get better scores than 6 layers (achieves 1.122 xe on your codebase, higher than 1.113 reported). But the meshed decoder part with 3 layers achieves worse than 1.122. So even if your transformer is ported, the results are still not good.
Thanks. The performance I mentioned is after using self critical. My finding is 3 layer is not as good as 6 layers. (I only looked at greedy decoding result on val set.)
I download m2transformer pretrained model and test it on karpathy test set. Nearly 1/3 (1718/5000) sentences have bad enddings like
'a woman eating a cupcake with lit candles on a',
'a street sign in front of a building with a'.
I test this model using their code and get
{'BLEU': [0.8076084272899184,
0.65337618312199,
0.5093125587687117,
0.3909357911782391],
'METEOR': 0.2918900660095916,
'ROUGE': 0.5863539878042495,
'CIDEr': 1.3119740267338893}
After removing these bad enddings, I get
{'BLEU': [0.8053965764171623,
0.6566697409874372,
0.516585609264117,
0.39901320646984684],
'METEOR': 0.2896812927767685,
'ROUGE': 0.5889955346514036,
'CIDEr': 1.290116122320751}
The way of cider computation they use seems encourage caption with bad enddings. I am not sure whether it is because CIDEr is 'gamed' by RL algorithm or something else.
I didn’t realize it’s that high. The reason of bad ending is because they didn’t add eos token while doing self critical. The cider score doesn’t punish on these bad endings.
Yes, I have tried to add eos and trained a model. There are no such sentences. But Cider also decreases. So it seems the only difference between two frameworks?
Yes it will decrease. There are other differences, including the vocab, the way to compute cider(they compute on raw text), the way they compute the tfidf in cider.
Hi, I've been working on the diversity evaluation on different frameworks, and I wonder if it is possible to evaluate m2 model with multiple samples? I've checked the repository branch you have provided, and haven't found the multiple sampling implementation in it. I guess it is not that simple to change the model file in the self-critical branch to solve this problem, the vocab and other framework changes, I mean. If there exists an easy way to integrate them, please inform me. Thanks a lot!
i think you can run the m2 transformer in my codebase. it wont behave as good because there is a bug in m2transformer.