DKN
DKN copied to clipboard
There are no parameters for the MLP part in the paper.
Dear authors,
I found that the code implementation is slightly different from the structure presented in the paper (Figure 3). In the paper, both the output of the attention and the final output are produced by a two-layer MLP. However, in the code dkn.py, they are implemented as:
attention_weights = tf.reduce_sum(clicked_embeddings * news_embeddings_expanded, axis=-1)
and
self.scores_unnormalized = tf.reduce_sum(user_embeddings * news_embeddings, axis=1)
They are only a inner product between two vectors and there are no any parameters to learn the concat. My question is:
- Why can this work? Can only updating embeddings without MLP weights get well trained?
- Why is the code implementation different from the paper?
- The inner product works according to the experiments. We can hardly say which one is always the better.
- The code here has been refactored and simplified considering the computation overhead.
However, if the attention network is just a simple "cosine similarity" between the user's clicked news and the candidate news, it cannot be called attention "network". Because it doesn't have any parameter can cannot learn anything. If the inner product can simply replace the attention network, it means that the attention network is meaningless.