tf_chatbot_seq2seq_antilm
tf_chatbot_seq2seq_antilm copied to clipboard
Beam search decoding is correct?
Hello, I am just curious how you implemented beam search decoding. Did you use the output logits of the default Seq2Seq model? I am working around these dialogue papers and trying to implement the beam search on top of the tensorflow's Seq2Seq model. The problem is in file "seq2seq.py" which is the tensorflow's implementation for sequence to sequence model. In that file, we have the loop_function which finds the best word from the previous step and then feeds it into the next step.
def _extract_argmax_and_embed(embedding,
output_projection=None,
update_embedding=True):
"""Get a loop_function that extracts the previous symbol and embeds it.
Args:
embedding: embedding tensor for symbols.
output_projection: None or a pair (W, B). If provided, each fed previous
output will first be multiplied by W and added B.
update_embedding: Boolean; if False, the gradients will not propagate
through the embeddings.
Returns:
A loop function.
"""
def loop_function(prev, _):
if output_projection is not None:
prev = nn_ops.xw_plus_b(prev, output_projection[0], output_projection[1])
prev_symbol = math_ops.argmax(prev, 1)
# Note that gradients will not propagate through the second parameter of
# embedding_lookup.
emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
if not update_embedding:
emb_prev = array_ops.stop_gradient(emb_prev)
return emb_prev
return loop_function
The issue with using output logits of the Seq2Seq model to implement beam search is that if you select top B words from the step t-1 (B is the beam size), any logits you select from the step t are conditioned by the best word from the step t-1. For example, let's say in step 1 of decoding you have selected top B words (W1, W2, ..., WB) and the best word (with the highest logit) is W_STAR. In step 2, we select another B words (M1, M2, ..., MB) and now we want to select the top B from B*B combinations. The problem is that the output logit for a word Mi, which is P(Mi), is actually P(Mi|W_STAR). So, how can we compute the score for the sequence W1 Mi? its probability will be P(W1) * P(Mi|W_STAR) which is not P(W1) * P(Mi|W1).
I think we have to change the architecture of the seq2seq model to enable beam search decoding, as discussed here: https://github.com/tensorflow/tensorflow/issues/654
I would like to know if the beam search implementation here is correct or not. Thanks a lot.
@SaeedNajafi Saee Well if we take the conditional probability of generating the next word on given content . It's not only the previous word that keep an account of . So best way to find the combination with highest score is multiply the each probability of the generated word and take the highest of them all .