Target-Guided-Conversation icon indicating copy to clipboard operation
Target-Guided-Conversation copied to clipboard

Some questions regarding evaluations on next keyword prediction

Open zhongpeixiang opened this issue 5 years ago • 10 comments

Hi,

Thank you very much for sharing your work!

I have a few questions regarding evaluations for keyword predictions. I'm sorry that I may miss or misunderstand your code since I'm not familiar with Tensorflow.

  1. For a given history of keywords, there can be multiple target keywords for the next turn. Do you minimize the negative log-likelihood losses for every target keyword? Is the batch loss averaged over batch size or the number of target keywords in the batch?

  2. How did you compute the correlation metric? Greedy, average or max embedding? Do you just compute the correlation between the top-1 keyword with target keywords or top-k keywords? Do you average across target keywords before or after computing correlations?

Any response will be appreciated.

Thanks, Peixiang

zhongpeixiang avatar May 18 '20 13:05 zhongpeixiang

Thanks for your attention,

  1. In this reposotiry, we consider next keyword prediction as a binary classification of each candidate keyword and minimize the cross entropy loss of both positive and nagetive labels. The loss is averaged over each keyword.
        kw_labels = tf.map_fn(lambda x: tf.sparse_to_dense(x, [self.kw_vocab.size], 1., 0., False),
                              keywords_ids, dtype=tf.float32, parallel_iterations=True)[:, 4:]
        loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=kw_labels, logits=matching_score)
        loss = tf.reduce_mean(loss)

You can also minimize the negative log-likelihood loss of every target keyword after a softmax layer. In my impression the training result is similar.

  1. The correlation metric is computed by the max cosine similarity of word embedding pair between the top-k predicted keywords and all words in the target response.

squareRoot3 avatar May 19 '20 07:05 squareRoot3

@squareRoot3 Thank you very much for the quick reply. I have two more questions regarding keyword prediction.

Q1

It seems that the test keywords are used as the vocab during training? Any reasons for this?

./config/data_config.py:

_keywords_path = 'tx_data/test/keywords_vocab.txt'

./model/neural.py:

self.kw_vocab = tx.data.Vocab(self.data_config._keywords_path)

Q2

I experimented both binary CE loss for every candidate keyword and negative log-likelihood loss for every target keyword, I found that the former has a R@1 of 0.015 but the latter has a R@1 of 0.065. Why the former loss is not comparable with your results?

Here is the PyTorch code to compute the two losses:

def compute_BCE(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len), we set seq_len=10 such that each utterance has a max of 10 target keywords, the rest are padded with 0
    """
    target_new = torch.zeros_like(logits) # (batch, vocab_size)
    target_new = target_new.scatter(1, target, 1.0)
    target_new[:,0] = 0 # assign pad token to 0
    loss = F.binary_cross_entropy_with_logits(logits, target_new)
    return loss
def compute_NLLLoss(logits, target):
    """
        logits: (batch, vocab_size)
        target: (batch, seq_len)
    """
    target_mask = target.ne(0).float() # (batch, seq_len), mask out paddings
    logits = F.log_softmax(logits, dim=-1)
    loss = -1 * (torch.gather(logits, dim=1, index=target) * target_mask).sum() # negative log-likelihood loss
    loss = loss/target_mask.sum()
    return loss

zhongpeixiang avatar May 20 '20 04:05 zhongpeixiang

Q1: I had thought that the test keyword vocab contains more frequent keywords and the size is relatively smaller, which can facilicate training. But using the train keyword vocab seems more reasonable. We have fixed this in the new repository: https://github.com/James-Yip/TGODC-DKRN.

Q2: It looks like that the implement of two losses are correct, so I am sorry that I have no ideas about it. The BCE loss in our repository works normally.

squareRoot3 avatar May 20 '20 07:05 squareRoot3

Sorry to bother you again. Another strange thing happed to the retrieval-neural model.

I trained a keyword prediction model and obtained around 0.08 test R@1.

I also trained a retrieval baseline (without ketword conditioning) and obtained around 0.51 test R@1.

However, when I train the retrieval-neural model to use predicted keywords to retrieve the next turn, the result is still around 0.51. It seems that using keywords do not improve model performance.

My implementation of conditioning on keyword follows your code:

  1. Predict top 3 keywords for next turn based on keywords history and pretrained keyword predictor.
  2. Average the 3 keyword embeddings.
  3. Apply a linear transformation and get K.
  4. Encode contextual utterances and get C.
  5. Concatenate with contextual utterance representation and get [C;K].
  6. Encode candidate responses and get R.
  7. Use a separate GRU encoder to encode candidates for comparision with keywords, get R_kw
  8. Concatenate two candidate representations and get [R;R_kw]
  9. Apply elementwise multiplication between [C;K] and [R;R_kw], followed by a linear transformation.
class SMN(nn.Module):
    def __init__(self, embed_size, vocab_size, hidden_size, n_layers, bidirectional, dropout=0):
        super(SMN, self).__init__()
        self.embed_size = embed_size
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.bidirectional = bidirectional
        self.dropout = dropout
        self.embedding = nn.Embedding(vocab_size, embed_size)

        self.utterance_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.context_encoder = nn.GRU(2*hidden_size, 2*hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=False)
        self.candidate_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.candidate_kw_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.kw_mlp = nn.Linear(embed_size, 2*hidden_size)
        self.match_MLP_kw = nn.Linear(4*hidden_size, 1)
        self.match_MLP = nn.Linear(2*hidden_size, 1)
    
    def init_embedding(self, embedding, fix_word_embedding):
        self.embedding.weight.data.copy_(embedding)
        if fix_word_embedding:
            self.embedding.weight.requires_grad = False
    
    def forward(self, context, candidate, keywords=None):
        """
            context: (batch_size, context_len, seq_len)
            candidate: (batch_size, num_candidates, seq_len)
            keywords: (batch_size, 3)
        """
        # print(context.shape, candidate.shape, keywords.shape)
        batch_size, context_len, seq_len = context.shape
        _, num_candidates, _ = candidate.shape
        context_seq_lengths = context.reshape(batch_size*context_len, -1).ne(0).long().sum(dim=-1) # (batch_size*context_len, )
        context_lengths = context_seq_lengths.reshape(batch_size, context_len).ne(0).long().sum(dim=-1) # (batch_size, )
        candidate_seq_lengths = candidate.reshape(batch_size*num_candidates, -1).ne(0).long().sum(dim=-1) # (batch_size*num_candidates, )
        
        # context encoding
        context_out = self.embedding(context) # (batch, context_len, seq_len, embed_size)
        context_out, _ = self.utterance_encoder(context_out.reshape(batch_size*context_len, seq_len, -1)) # (batch*context_len, seq_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size*context_len), (context_seq_lengths-1).clamp(min=0)] # (batch*context_len, 2*hidden_size)

        context_out, _ = self.context_encoder(context_out.reshape(batch_size, context_len, -1)) # (batch, context_len, 2*hidden_size)
        context_out = context_out[torch.arange(batch_size), (context_lengths-1).clamp(min=0)] # (batch, hidden_size)

        # keyword encoding
        if keywords is not None:
            kw_out = self.embedding(keywords) # (batch, 3, embed_size)
            kw_out = self.kw_mlp(kw_out.sum(dim=1)) # (batch, 2*hidden_size)
            context_out = torch.cat([context_out, kw_out], dim=-1) # (batch, 4*hidden_size)

        # candidate encoding
        candidate_emb = self.embedding(candidate) # (batch, num_candidates, seq_len, embed_size)
        candidate_out, _ = self.candidate_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
        candidate_out = candidate_out[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)

        # candidate encoding to compare with keywords
        if keywords is not None:
            candidate_out_kw, _ = self.candidate_kw_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
            candidate_out_kw = candidate_out_kw[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)
            candidate_out = torch.cat([candidate_out, candidate_out_kw], dim=-1) # (batch*num_candidates, 4*hidden_size)
            out = self.match_MLP_kw((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        else:
            out = self.match_MLP((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
        return out

zhongpeixiang avatar May 20 '20 09:05 zhongpeixiang

It seems that the word embedding of the trained keyword predictor is used in the retrieval model. I didn't implement that. I will fix this and let you know if it works.

zhongpeixiang avatar May 20 '20 09:05 zhongpeixiang

After reusing the word embedding from the trained keyword predictor, the retrieval-neural model achieves test R1 of 0.5235, which is still a bit below the reported 0.5395. Hmm...

zhongpeixiang avatar May 20 '20 10:05 zhongpeixiang

I suspect that one of the reasons is that we use different pretrained word embeddings. How is your pretrained word embedding obtained? GloVe on PersonaChat or GloVe from one of the files here https://nlp.stanford.edu/projects/glove/ ?

zhongpeixiang avatar May 22 '20 11:05 zhongpeixiang

The embedding file is provided in the source data. It is obtained from https://nlp.stanford.edu/projects/glove/ (seems to be glove.twitter.27B.zip)

Sorry that I am busy with some deadlines and have no time to check your codes. If you still have any question about this repository, feel free to ask me.

squareRoot3 avatar May 23 '20 03:05 squareRoot3

Any advice regarding why my model didn’t get improved after incorporating keyword?

zhongpeixiang avatar Jun 03 '20 11:06 zhongpeixiang

Any updates?

zhongpeixiang avatar Jun 10 '20 07:06 zhongpeixiang