recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Incremental retrieval model training with Hashing method

Open nicewenhui opened this issue 1 year ago • 2 comments

I have developed a retrieval model for personalized movie recommendations. However, in the real world, new users and new content continue to emerge. To address this challenge, I have learned about the benefits of using hashing embedding.

In tutorial, I found the hashing layer was putted as part of the model architecture. why this can avoid retraining the model every time? Besides, I don't know how to handle hashing collisions and determine the appropriate value for the num_bins parameter. In the provided example, even with only 5 inputs and setting num_bins to 6, 2 values (['b'],['c'] ) were still hashed to the same bin.

layer = layer = tf.keras.layers.Hashing(num_bins=6)
inp = [['a'], ['b'], ['c'], ['d'], ['e']]
layer(inp)

<tf.Tensor: shape=(5, 1), dtype=int64, numpy=
array([[3],
       [4],
       [4],
       [5],
       [1]])>

In my real codes, for example, I have 10000 user_id before, and each day will have around 1000 new users, how should I set the num_bins to ensure each user has their unique hashed code? How about calculating the total number of users each day and setting the num_bins parameter to the number of users for that specific day? Will the old users still have the same hashed codes as before?

Thanks in advance.

nicewenhui avatar Nov 13 '23 09:11 nicewenhui

I guess hashing have two main benefits, one that new items gets mapped to different hashing buckets (so not all new items will be treated the same), and having a collision can have a regularization effect, specifically if it's not too much collisions, and the fact that usually we have sparse data for rare items so dedicating a single embedding for them may overfit.

OmarMAmin avatar Jan 29 '24 18:01 OmarMAmin

For the user_ids, You can represent the user id by the item_ids the user is consuming, to avoid retraining the model for each new user_id, if the cataloge is more stable, you'll have a more stable model representing the user id by selected features out of his previous behavior

OmarMAmin avatar Jan 29 '24 18:01 OmarMAmin