recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

[Question] Difference between ncf and two-tower-model

Open jillwalker99 opened this issue 2 years ago • 18 comments

Hi all, as I delved further into the topic of recommender systems, the question came up of what the difference is between the two-tower model used here "https://www.tensorflow.org/recommenders/examples/basic_retrieval" and neural collaborative filtering (https://arxiv.org/abs/1708.05031).

Many thanks in advance.

jillwalker99 avatar Feb 05 '23 13:02 jillwalker99

Neural Collaborative Filtering is a class of embedding factorization models where the similarity function between the user and item embedding is learned, (usually by an MLP), as opposed to being a dot-product. The model you have linked can be thought of as a two-tower model (not all of the NCF architectures can), but it would be considered a ranking model and not a retrieval model.

In the NCF case, if you have K items, you will need to inference the model K times to get predictions for a single user. This is expensive and usually becomes unfeasible as quickly as 10K candidates.

This paper, "Neural Collaborative Filtering vs. Matrix Factorization Revisited" suggests that the benefit of using an MLP scoring function is marginal, and requires careful tuning, whilst a dot-product interaction is a robust choice that offers efficient serving that scales to millions of candidate items.

That said, these architectures along with others might be useful in the subsequent ranking stage.

patrickorlando avatar Feb 06 '23 00:02 patrickorlando

Thank you for your feedback @patrickorlando. Would that mean that the linked model belongs to the category of class Neural Collaborative Filtering? Is the linked model a ranking model because the output indicates the dot product and therefore the similarity?

jillwalker99 avatar Feb 06 '23 21:02 jillwalker99

If the model uses a learnable layer to calculate the similarity/relevance score, then it may be considered an NCF model, but I wouldn't focus too much on this terminology. The key thing to remember is that if the similarity function is a dot-product (first-order), then it can be efficiently calculated at inference time using Approximate Nearest Neighbour search. It may however not be as powerful as a more expressive model which allows non-linear interactions between the user and item features (Deep Cross Networks, DLRM, Gradient Boosted Tree Ranking, ...). The modern approach to this problem is to break the problem up into stages. (see figure below)

Screenshot 2023-02-07 at 9 20 02 am Source: Deep Neural Networks for YouTube Recommendations

This article from NVIDIA is also a helpful introduction to the concept.

The TFRS library is aligned with the concept of multi-stage recommender systems.

patrickorlando avatar Feb 06 '23 22:02 patrickorlando

@patrickorlando, Thanks for the info If the item space is around 4K, is it better to just do the ranking stage directly? Do you know any papers discussing the candidate space size vs ranking alone or retrieval vs ranking ?

OmarMAmin avatar Feb 07 '23 11:02 OmarMAmin

@OmarMAmin I think this depends primarily on the serving time - if this is still in line with the number of candidates, the ranking model should be sufficient.

jillwalker99 avatar Feb 07 '23 21:02 jillwalker99

@patrickorlando thanks again :) Do you know of any other papers or explanations of the Two-Tower Model being developed in the Tensorflow Guides - to understand the architecture and system in detail (apart from the YouTube paper about the recommender system)?

jillwalker99 avatar Feb 07 '23 21:02 jillwalker99

@OmarMAmin, @jillwalker99 is correct, you may choose to implement a ranking only model provided serving time and cost is within budget. There is one other benefit to having a dot-product scoring function, namely that user and item embeddings will be embedded in a vector space with distance metrics. Items and users that are similar will be close together based on cosine/euclidean distances. You can cluster items or users, ensure that items returned at not too similar, etc.

patrickorlando avatar Feb 07 '23 22:02 patrickorlando

Hi @patrickorlando I have two questions again 1. Where is the dot product calculated within the Two Tower Model in Tensorflow (https://www.tensorflow.org/recommenders/examples/basic_retrieval)?

  1. the Two Tower Model is a classification from the perspective of the machine learning task. Is it a multi class classification where every possible interaction represents a class or is it a binary classification between positive interactions and all others? Or how can this be understood?

jillwalker99 avatar Feb 13 '23 20:02 jillwalker99

  1. It is calculated in the retrieval task, https://github.com/tensorflow/recommenders/blob/7caed557b9d5194202d8323f2d4795231a5d0b1d/tensorflow_recommenders/tasks/retrieval.py#L160-L161
  2. It is modelled as a massive multi class classification problem. Every candidate is a class. However candidates are sampled in each batch as opposed to calculating all possible classes for each batch. This is called a Sampled Softmax Loss. See the discussion in #334 for further details.

patrickorlando avatar Feb 13 '23 23:02 patrickorlando

Thank you very much for your help @patrickorlando . So for each user it tries to predict the class (candidate / item). The correct class is the positive interaction of the user with an item, right? One more question about development: what if a dataset is used that is based on a recommendation system for popularity (recommends the x most popular products based on sales). Then the offline evaluation with e.g. Top X Accuracy is distorted and live A/B tests are necessary or am I seeing something wrong here (as probably niche products are sold even less - as the recommendation system which was used before does not recommend them)?

jillwalker99 avatar Feb 14 '23 21:02 jillwalker99

Yes to both questions.

Evaluating recommender systems is hard. Online A/B tests are useful. A model that performs well on a dataset doesn't guarantee that you have a great recommender system. In general it's important to make sure that a model is not the only way users interact with items, otherwise you create a feedback loop and your overall system can get stuck in local minima. In your case, your model will be biased to popular items. You can down-weight or subsample these interactions, and you should think about how to include exploration into your system for future data collection.

patrickorlando avatar Feb 14 '23 22:02 patrickorlando

Hi @patrickorlando is it correct to add a dropout layer for the query model and the candidate_model or what would be the right approach here?

`self.candidate_model= tf.keras.Sequential([ Item_Model(), tf.keras.layers.Dropout(0.1), tf.keras.layers.Dense(64)])

self.query_model= tf.keras.Sequential([ User_Model(), tf.keras.layers.Dropout(0.1), tf.keras.layers.Dense(64)])`

jillwalker99 avatar Feb 18 '23 22:02 jillwalker99

Hi @jillwalker99, Sure, dropout can be added to your query and item towers, but you probably will need to tune this parameter. You also might want to add L-2 normalization as the last layer for each tower and tune the temperature parameter for your retrieval task. See the discussion in #633.

patrickorlando avatar Feb 19 '23 22:02 patrickorlando

Thank you @patrickorlando do you mean the "kernel_regularizer=tf.keras.regularizers.L2(0.001)" or how is it possible to implement L2 normalization? What exactly does the L2 normalization do - I thought it allowed, among other things, the determination of cosine similarity instead of the dot product. Or what is the primary reason for using it?

jillwalker99 avatar Feb 20 '23 20:02 jillwalker99

L2 normalization scales the vector by its euclidean length. It means that the outputs of your query and candidate towers will be constrained to the unit sphere. As the paper referenced in #633 states, this improves model training behaviour, but requires that the softmax temperature (which then scales the dot-product scores) to be tuned carefully.

class L2Normalization(tf.keras.layers.Layer):
    def __init__(self, axis=-1, **kwargs):
        super().__init__(**kwargs)
        self._axis = axis

    def call(self, inputs):
        return tf.linalg.l2_normalize(inputs, axis=self._axis)

    def get_config(self):
        return {"axis": self._axis}

or simply

l2_norm = tf.keras.layers.Lambda(lambda x:  tf.linalg.l2_normalize(x, axis=-1))

patrickorlando avatar Feb 20 '23 21:02 patrickorlando

thanks again :) would it make sense to also use kernel_regularizer for the other hidden layers?

jillwalker99 avatar Feb 25 '23 20:02 jillwalker99

@jillwalker99, perhaps, there is no one size fits all approach.

patrickorlando avatar Feb 27 '23 23:02 patrickorlando