Sequence-to-Sequence-and-Attention-from-scratch-using-Tensorflow
Sequence-to-Sequence-and-Attention-from-scratch-using-Tensorflow copied to clipboard
soft attention function
Thanks for the clean and nice to read code! I think there might be a bug in the soft attention module:
eij=tf.tanh(unrol_states)
#Softmax across the unrolling dimension
softmax=tf.nn.softmax(eij,dim=1)
context=tf.reduce_sum(tf.multiply(softmax,unrol_states),axis=1) #Sum across axis time
According to the cited attention paper the eij in your code correspond to the eij on page 3, and the softmax variable in the code should be the \alpha_{ij} from equation (6) on page 3 from the paper. So far so good, but in the paper the authors use the \alpha_{ij} to create the context vector by averaging the encoder outputs h1,...,hN while you first transform these encoder outputs by
for h in range(num_unrollings):
hidden_states[h]=tf.multiply(hidden_states[h],attn_weights)+prev_hidden_state_times_w
and then apply the context averaging. Is there some special reason to do so?
Thanks a lot and cheers, g
The implementation is based on the following blog https://blog.heuritech.com/2016/01/20/attention-mechanism/ . The soft attention returns a weighted arithmetic mean of the y_i, and the weights are chosen according the relevance of each y_i given the context c. These are the weights that are multiplied in the for loop.