self-critical.pytorch icon indicating copy to clipboard operation
self-critical.pytorch copied to clipboard

To use m2transformer

Open ruotianluo opened this issue 5 years ago • 8 comments

The m2transformer in this repo will get worse results than the transformer implemented in this repo, and also worse than what's in the m2transformer paper.

In short, the difference is caused by the framework, not the model architecture. That is to say, if you port my transformer into their codebase, you can also better results than m2transformer. If you are curious, please checkout (https://github.com/ruotianluo/meshed-memory-transformer/tree/mytransformer), and look at the description of the commit.

I have been trying to rule out what part of their codebase leads to higher performance. It has been narrowed down to the dataloader and cider computation they use. (So, not learning rate, not beam search self-critical, not gradient clipping, etc). I have been able to port their dataloader and cider computation into my codebase, and I can fairly say I am able to reproduce the results given by m2transformer.

I am keeping digging deeper to see what specific is changing the game. Will update here.

I still appreciate their codebase. The design is much more modern than mine.

ruotianluo avatar Apr 19 '20 23:04 ruotianluo

Hi @ruotianluo. Thanks for your investigation. December last year I implemented the meshed decoder part on your codebase, by changing the DecoderLayer to:

class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
        self.fc_alpha1 = nn.Linear(d_model * 2, d_model)
        self.fc_alpha2 = nn.Linear(d_model * 2, d_model)
        self.fc_alpha3 = nn.Linear(d_model * 2, d_model)
 
    def forward(self, x, memory, src_mask, tgt_mask):
        # memory of shape:  (batch_size, num_layers, num_boxes, d_model)
        query = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        enc_att1 = self.src_attn(query, memory[:, 0], memory[:, 0], src_mask)
        enc_att2 = self.src_attn(query, memory[:, 1], memory[:, 1], src_mask)
        enc_att3 = self.src_attn(query, memory[:, 2], memory[:, 2], src_mask)
        alpha1 = torch.sigmoid(self.fc_alpha1(torch.cat([query, enc_att1], -1)))
        alpha2 = torch.sigmoid(self.fc_alpha2(torch.cat([query, enc_att2], -1)))
        alpha3 = torch.sigmoid(self.fc_alpha3(torch.cat([query, enc_att3], -1)))
        sum_src_attn = (enc_att1 * alpha1 + enc_att2 * alpha2 + enc_att3 * alpha3) / np.sqrt(3)
        out_sum = self.sublayer[1](sum_src_attn, None)
        return self.sublayer[2](out_sum, self.feed_forward)

The original transformer with 3 layers can actually get better scores than 6 layers (achieves 1.122 xe on your codebase, higher than 1.113 reported). But the meshed decoder part with 3 layers achieves worse than 1.122. So even if your transformer is ported, the results are still not good.

fawazsammani avatar Apr 20 '20 05:04 fawazsammani

Thanks. The performance I mentioned is after using self critical. My finding is 3 layer is not as good as 6 layers. (I only looked at greedy decoding result on val set.)

ruotianluo avatar Apr 20 '20 05:04 ruotianluo

I download m2transformer pretrained model and test it on karpathy test set. Nearly 1/3 (1718/5000) sentences have bad enddings like

'a woman eating a cupcake with lit candles on a',
'a street sign in front of a building with a'.

I test this model using their code and get

{'BLEU': [0.8076084272899184,
  0.65337618312199,
  0.5093125587687117,
  0.3909357911782391],
 'METEOR': 0.2918900660095916,
 'ROUGE': 0.5863539878042495,
 'CIDEr': 1.3119740267338893}

After removing these bad enddings, I get

{'BLEU': [0.8053965764171623,
  0.6566697409874372,
  0.516585609264117,
  0.39901320646984684],
 'METEOR': 0.2896812927767685,
 'ROUGE': 0.5889955346514036,
 'CIDEr': 1.290116122320751}

The way of cider computation they use seems encourage caption with bad enddings. I am not sure whether it is because CIDEr is 'gamed' by RL algorithm or something else.

luo3300612 avatar Apr 29 '20 07:04 luo3300612

I didn’t realize it’s that high. The reason of bad ending is because they didn’t add eos token while doing self critical. The cider score doesn’t punish on these bad endings.

ruotianluo avatar Apr 29 '20 13:04 ruotianluo

Yes, I have tried to add eos and trained a model. There are no such sentences. But Cider also decreases. So it seems the only difference between two frameworks?

luo3300612 avatar Apr 29 '20 14:04 luo3300612

Yes it will decrease. There are other differences, including the vocab, the way to compute cider(they compute on raw text), the way they compute the tfidf in cider.

ruotianluo avatar Apr 29 '20 14:04 ruotianluo

Hi, I've been working on the diversity evaluation on different frameworks, and I wonder if it is possible to evaluate m2 model with multiple samples? I've checked the repository branch you have provided, and haven't found the multiple sampling implementation in it. I guess it is not that simple to change the model file in the self-critical branch to solve this problem, the vocab and other framework changes, I mean. If there exists an easy way to integrate them, please inform me. Thanks a lot!

kaelsunkiller avatar Apr 29 '22 06:04 kaelsunkiller

i think you can run the m2 transformer in my codebase. it wont behave as good because there is a bug in m2transformer.

ruotianluo avatar Apr 29 '22 07:04 ruotianluo