Target-Guided-Conversation
Target-Guided-Conversation copied to clipboard
Some questions regarding evaluations on next keyword prediction
Hi,
Thank you very much for sharing your work!
I have a few questions regarding evaluations for keyword predictions. I'm sorry that I may miss or misunderstand your code since I'm not familiar with Tensorflow.
-
For a given history of keywords, there can be multiple target keywords for the next turn. Do you minimize the negative log-likelihood losses for every target keyword? Is the batch loss averaged over batch size or the number of target keywords in the batch?
-
How did you compute the
correlationmetric? Greedy, average or max embedding? Do you just compute the correlation between the top-1 keyword with target keywords or top-k keywords? Do you average across target keywords before or after computing correlations?
Any response will be appreciated.
Thanks, Peixiang
Thanks for your attention,
- In this reposotiry, we consider next keyword prediction as a binary classification of each candidate keyword and minimize the cross entropy loss of both positive and nagetive labels. The loss is averaged over each keyword.
kw_labels = tf.map_fn(lambda x: tf.sparse_to_dense(x, [self.kw_vocab.size], 1., 0., False),
keywords_ids, dtype=tf.float32, parallel_iterations=True)[:, 4:]
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=kw_labels, logits=matching_score)
loss = tf.reduce_mean(loss)
You can also minimize the negative log-likelihood loss of every target keyword after a softmax layer. In my impression the training result is similar.
- The correlation metric is computed by the max cosine similarity of word embedding pair between the top-k predicted keywords and all words in the target response.
@squareRoot3 Thank you very much for the quick reply. I have two more questions regarding keyword prediction.
Q1
It seems that the test keywords are used as the vocab during training? Any reasons for this?
./config/data_config.py:
_keywords_path = 'tx_data/test/keywords_vocab.txt'
./model/neural.py:
self.kw_vocab = tx.data.Vocab(self.data_config._keywords_path)
Q2
I experimented both binary CE loss for every candidate keyword and negative log-likelihood loss for every target keyword, I found that the former has a R@1 of 0.015 but the latter has a R@1 of 0.065. Why the former loss is not comparable with your results?
Here is the PyTorch code to compute the two losses:
def compute_BCE(logits, target):
"""
logits: (batch, vocab_size)
target: (batch, seq_len), we set seq_len=10 such that each utterance has a max of 10 target keywords, the rest are padded with 0
"""
target_new = torch.zeros_like(logits) # (batch, vocab_size)
target_new = target_new.scatter(1, target, 1.0)
target_new[:,0] = 0 # assign pad token to 0
loss = F.binary_cross_entropy_with_logits(logits, target_new)
return loss
def compute_NLLLoss(logits, target):
"""
logits: (batch, vocab_size)
target: (batch, seq_len)
"""
target_mask = target.ne(0).float() # (batch, seq_len), mask out paddings
logits = F.log_softmax(logits, dim=-1)
loss = -1 * (torch.gather(logits, dim=1, index=target) * target_mask).sum() # negative log-likelihood loss
loss = loss/target_mask.sum()
return loss
Q1: I had thought that the test keyword vocab contains more frequent keywords and the size is relatively smaller, which can facilicate training. But using the train keyword vocab seems more reasonable. We have fixed this in the new repository: https://github.com/James-Yip/TGODC-DKRN.
Q2: It looks like that the implement of two losses are correct, so I am sorry that I have no ideas about it. The BCE loss in our repository works normally.
Sorry to bother you again. Another strange thing happed to the retrieval-neural model.
I trained a keyword prediction model and obtained around 0.08 test R@1.
I also trained a retrieval baseline (without ketword conditioning) and obtained around 0.51 test R@1.
However, when I train the retrieval-neural model to use predicted keywords to retrieve the next turn, the result is still around 0.51. It seems that using keywords do not improve model performance.
My implementation of conditioning on keyword follows your code:
- Predict top 3 keywords for next turn based on keywords history and pretrained keyword predictor.
- Average the 3 keyword embeddings.
- Apply a linear transformation and get
K. - Encode contextual utterances and get
C. - Concatenate with contextual utterance representation and get
[C;K]. - Encode candidate responses and get
R. - Use a separate GRU encoder to encode candidates for comparision with keywords, get
R_kw - Concatenate two candidate representations and get
[R;R_kw] - Apply elementwise multiplication between
[C;K]and[R;R_kw], followed by a linear transformation.
class SMN(nn.Module):
def __init__(self, embed_size, vocab_size, hidden_size, n_layers, bidirectional, dropout=0):
super(SMN, self).__init__()
self.embed_size = embed_size
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.n_layers = n_layers
self.bidirectional = bidirectional
self.dropout = dropout
self.embedding = nn.Embedding(vocab_size, embed_size)
self.utterance_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
self.context_encoder = nn.GRU(2*hidden_size, 2*hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=False)
self.candidate_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
self.candidate_kw_encoder = nn.GRU(embed_size, hidden_size, n_layers, batch_first=True, dropout=dropout, bidirectional=True)
self.kw_mlp = nn.Linear(embed_size, 2*hidden_size)
self.match_MLP_kw = nn.Linear(4*hidden_size, 1)
self.match_MLP = nn.Linear(2*hidden_size, 1)
def init_embedding(self, embedding, fix_word_embedding):
self.embedding.weight.data.copy_(embedding)
if fix_word_embedding:
self.embedding.weight.requires_grad = False
def forward(self, context, candidate, keywords=None):
"""
context: (batch_size, context_len, seq_len)
candidate: (batch_size, num_candidates, seq_len)
keywords: (batch_size, 3)
"""
# print(context.shape, candidate.shape, keywords.shape)
batch_size, context_len, seq_len = context.shape
_, num_candidates, _ = candidate.shape
context_seq_lengths = context.reshape(batch_size*context_len, -1).ne(0).long().sum(dim=-1) # (batch_size*context_len, )
context_lengths = context_seq_lengths.reshape(batch_size, context_len).ne(0).long().sum(dim=-1) # (batch_size, )
candidate_seq_lengths = candidate.reshape(batch_size*num_candidates, -1).ne(0).long().sum(dim=-1) # (batch_size*num_candidates, )
# context encoding
context_out = self.embedding(context) # (batch, context_len, seq_len, embed_size)
context_out, _ = self.utterance_encoder(context_out.reshape(batch_size*context_len, seq_len, -1)) # (batch*context_len, seq_len, 2*hidden_size)
context_out = context_out[torch.arange(batch_size*context_len), (context_seq_lengths-1).clamp(min=0)] # (batch*context_len, 2*hidden_size)
context_out, _ = self.context_encoder(context_out.reshape(batch_size, context_len, -1)) # (batch, context_len, 2*hidden_size)
context_out = context_out[torch.arange(batch_size), (context_lengths-1).clamp(min=0)] # (batch, hidden_size)
# keyword encoding
if keywords is not None:
kw_out = self.embedding(keywords) # (batch, 3, embed_size)
kw_out = self.kw_mlp(kw_out.sum(dim=1)) # (batch, 2*hidden_size)
context_out = torch.cat([context_out, kw_out], dim=-1) # (batch, 4*hidden_size)
# candidate encoding
candidate_emb = self.embedding(candidate) # (batch, num_candidates, seq_len, embed_size)
candidate_out, _ = self.candidate_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
candidate_out = candidate_out[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)
# candidate encoding to compare with keywords
if keywords is not None:
candidate_out_kw, _ = self.candidate_kw_encoder(candidate_emb.reshape(batch_size*num_candidates, seq_len, -1)) # (batch*num_candidates, seq_len, 2*hidden_size)
candidate_out_kw = candidate_out_kw[torch.arange(batch_size*num_candidates), (candidate_seq_lengths-1).clamp(min=0)] # (batch*num_candidates, 2*hidden_size)
candidate_out = torch.cat([candidate_out, candidate_out_kw], dim=-1) # (batch*num_candidates, 4*hidden_size)
out = self.match_MLP_kw((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
else:
out = self.match_MLP((context_out.unsqueeze(1) * candidate_out.reshape(batch_size, num_candidates, -1))).squeeze(-1) # (batch, num_candidates)
return out
It seems that the word embedding of the trained keyword predictor is used in the retrieval model. I didn't implement that. I will fix this and let you know if it works.
After reusing the word embedding from the trained keyword predictor, the retrieval-neural model achieves test R1 of 0.5235, which is still a bit below the reported 0.5395. Hmm...
I suspect that one of the reasons is that we use different pretrained word embeddings. How is your pretrained word embedding obtained? GloVe on PersonaChat or GloVe from one of the files here https://nlp.stanford.edu/projects/glove/ ?
The embedding file is provided in the source data. It is obtained from https://nlp.stanford.edu/projects/glove/ (seems to be glove.twitter.27B.zip)
Sorry that I am busy with some deadlines and have no time to check your codes. If you still have any question about this repository, feel free to ask me.
Any advice regarding why my model didn’t get improved after incorporating keyword?
Any updates?