recommenders
recommenders copied to clipboard
Seeking Guidance on Feature Preparation for Retrieval and Ranking Models
Hello,
I am currently working on developing a recommender system for internal customers and have found this framework to be a great tool to start with. As a beginner in the field of recommender systems, I am hoping to learn from the experts here.
The current problem I am facing involves a relatively small dataset of around 100 users and 200k items. However, this item dataset is expected to increase to 4 million in the near future. I have binary interaction data as well as negative labels (items presented to users that they did not act on), although the size of the negative labels is small.
Based on my research, a two-stage or multi-stage recommender system seems to be a better approach compared to a single-stage recommender. Additionally, from the other discussions here, it seems that it is a default practice to train the retrieval model with positive data points and the ranking model with both positive and negative data points.
The remaining doubts I have are
- How to prepare features for the retrieval model and the ranking model, respectively? Should they be the same or different?
- Is sparsity in the interaction matrix a significant issue? In this problem setup, items have to be interacted with at least once by a user, but typically by a few users only at most due to the ownership mechanism in place (a user has to give up an item so that other users can own it).
Any guidance or suggestions would be greatly appreciated. Thank you!
- Typically, most features can be shared between retrieval and ranking models. But in most cases, the ranking model will have more complex features because of the limited candidate numbers
- Regarding sparsity, I think the critical issue in your system is only 100 users. May I ask how many data samples you have? If the dataset is small, it's better not to use DNN. Traditional approaches like CF or FM can work quite well. For such a small user set, you can actually use the brute-force method to generate the best result for each user offline. There is no need to use 2-stage models. For example, only build a ranking model, and use it to rank all the 200k items for each user. Then dump this result to the online service.
- Typically, most features can be shared between retrieval and ranking models. But in most cases, the ranking model will have more complex features because of the limited candidate numbers
- Regarding sparsity, I think the critical issue in your system is only 100 users. May I ask how many data samples you have? If the dataset is small, it's better not to use DNN. Traditional approaches like CF or FM can work quite well. For such a small user set, you can actually use the brute-force method to generate the best result for each user offline. There is no need to use 2-stage models. For example, only build a ranking model, and use it to rank all the 200k items for each user. Then dump this result to the online service.
Thank you for your advice! The positive interaction dataset contains approximately 100k to 200k data points, and the negative interaction dataset is about the same size.
I am interested in using a framework like this that can incorporate user and item features to develop a content-based filtering system, as recommending items that other users like does not meet our business requirements.
I do like the idea of a simplified, single-stage system. I am not sure whether this approach would encounter any difficulties when scaling up to handle 4 million items. Most of these items are fresh and have not been interacted with, and the majority of them are irrelevant. My goal is to recommend relevant items from this set to users, and there is also a need to generate item-to-item recommendations by identifying the top similar items from the 4 million set. I'm currently doing offline scoring using the brute-force method for the 200k dataset, and I plan to transition to using ANN when the dataset scales up to 4 million items.
contains approximately 100k to 200k data points,
Is this a total dataset or just samples from one day? If it's a daily number, a simple approach is just expand the time window, including more samples from the past. If this is the total number, then it's pretty tiny. You can try simple models with simple features, otherwise, it can be easily overfitting.
as recommending items that other users like does not meet our business requirements.
This is a little bit confusing to me. Normally this is the duty of the recommender system to recommend potential relevant items from other users.
Most of these items are fresh and have not been interacted with, and the majority of them are irrelevant
This brings the issue of item cold-start. I'm not sure if your model can handle this well. A common approach is to leverage content-based recommendations to mitigate it. Also, I'm pretty worried, since you only have 100 users, how is it possible to let the 100 users view 4m items? If the majority is irrelevant, pre-filter before ranking is a good idea. Simply turning the CG into a content-based filter can work.
I'm currently doing offline scoring using the brute-force method for the 200k dataset, and I plan to transition to using ANN when the dataset scales up to 4 million items.
That depends on how fast your code runs on the 200k dataset. ANN can easily handle 4m items, and the search can also be done offline.
Whether to build an online ranking system mainly depend on how fast you want the new items be delivered to users
But first of all, seems like you already have embeddings? May I know how you get them?
Is this a total dataset or just samples from one day? If it's a daily number, a simple approach is just expand the time window, including more samples from the past. If this is the total number, then it's pretty tiny. You can try simple models with simple features, otherwise, it can be easily overfitting.
This is the total dataset. It's indeed tiny but I'm trying to see what I can get out of it. Do you think this is too small for the two-tower model even if it's not very deep?
This is a little bit confusing to me. Normally this is the duty of the recommender system to recommend potential relevant items from other users.
The business problem I'm facing is a bit unique because each item can be only owned by one user and the item can only be owned by other users if the current owner decides to give up the item.
This brings the issue of item cold-start. I'm not sure if your model can handle this well. A common approach is to leverage content-based recommendations to mitigate it. Also, I'm pretty worried, since you only have 100 users, how is it possible to let the 100 users view 4m items? If the majority is irrelevant, pre-filter before ranking is a good idea. Simply turning the CG into a content-based filter can work.
Good point, that's why I'm planning to build a content-based recommender using Tensorflow Recommenders. Pre-filtering definitely seems like a good idea. By the way, what does CG refer to in this context?
That depends on how fast your code runs on the 200k dataset. ANN can easily handle 4m items, and the search can also be done offline. Whether to build an online ranking system mainly depend on
how fast you want the new items be delivered to usersBut first of all, seems like you already have embeddings? May I know how you get them?
For now, item embeddings are obtained by feeding their descriptions through a pre-trained NLP model, while user embeddings are obtained as the average of the corresponding item embeddings. However, the ultimate goal is to improve the embeddings by incorporating other features such as text, categorical data, and numerical data, and passing them through the two-tower architecture.
Do you think this is too small for the two-tower model even if it's not very deep?
Aha, I didn't have experience with such a small dataset. We can get some insights from the example of the TensorFlow recommender. In the example, they use only one ID embedding feature and train the data on a 100k dataset, then achieve a good performance. You can follow and test the performance. The development cost is controllable.
The business problem I'm facing is a bit unique because each item can be only owned by one user and the item can only be owned by other users if the current owner decides to give up the item.
I see. Then the key point is whether the users have similar behavior patterns. If so, the model should be able to learn the knowledge from the data.
By the way, what does CG refer to in this context?
This is short for candidate generator (generation). I finally understand your idea. You are trying to build a two-tower model mainly leveraging content features, right? And this is also a kind of content-based recommendation.
while user embeddings are obtained as the average of the corresponding item embeddings.
Instead of average, you can also try sequential models like RNN or transformer. NLP embeddings do work well based on my experience. I think in general, you can start from a simple two-tower model leveraging TFRS and treat this model as a baseline ranking model. Two-tower can also be considered as a ranking model in a broad sense. Then if it doesn't work, go back to traditional models like CF, FM, LR, GBDT. If it works, then try to add a more complex ranking model after two-tower.
I think in general, you can start from a simple two-tower model leveraging TFRS and treat this model as a baseline ranking model. Two-tower can also be considered as a ranking model in a broad sense. Then if it doesn't work, go back to traditional models like CF, FM, LR, GBDT. If it works, then try to add a more complex ranking model after two-tower.
Sounds like a plan! Thanks for the tips!
If I may, I have one more question. Is the two-tower model a suitable approach for item-to-item recommendations, like recommending similar items based on a queried item? It appears straightforward to use item embeddings coming out of the item tower to compute similarity scores. However, I'm uncertain about the performance of training a two-tower model using user-item interaction data to achieve this goal versus training a model based solely on item similarity data (no user tower in this case).
Is the two-tower model a suitable approach for item-to-item recommendations, like recommending similar items based on a queried item?
Yes. It is. A good practice is to leverage NLP embeddings like SentenceBert or SimCSE as input features, then train them on user action data. This approach can learn both semantics and user actions. If your business requirements prefer semantic similarity, you can even try only using NLP embeddings.
versus training a model based solely on item similarity data
For solely item similarity, NLP embeddings already work great. I think there is no need to do this again. After all, this depends on your business requirement and user experience. If use NLP embeddings, the title of similar items will be very close to each other.
For solely item similarity, NLP embeddings already work great.
Indeed, it works reasonably well as a starting point. I'm looking to incorporate categorical features as well. However, I'm concerned about the differences in dimension and distribution between the text embeddings and encoded categorical features. Do you think cosine similarity will still be effective if we concatenate text embeddings and encoded categorical features? I'm considering using a neural network to condense the concatenated features down to a dense layer if the cosine similarity is impacted.
Do you think cosine similarity will still be effective if we concatenate text embeddings and encoded categorical features?
Yes, the cosine similarity is what the two-tower model learns. Concatenation + FC layers is a common approach to extracting deep features. You can refer to the two-tower or YouTubeDNN paper, they all use the same idea. I wrote a few blogs about this. Maybe you will be interested
I wrote a few blogs about this. Maybe you will be interested
Thank you for sharing. Definitely very helpful!