RecommenderSystems icon indicating copy to clipboard operation
RecommenderSystems copied to clipboard

Question about ReLU in Multi-Head Attention

Open ArtanisCV opened this issue 2 years ago • 0 comments

In multi-head attention, there is a relu after queries, keys, and values. Is this a correct implementation? The paper did not mention the relu in Eq. 5. Besides, it seems that the relu will make the attention matrix always positive.

# Linear projections
Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)
K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)
V = tf.layers.dense(values, num_units, activation=tf.nn.relu)```

ArtanisCV avatar Aug 10 '22 01:08 ArtanisCV