recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Get similarity score/rank for a subset of N candidates

Open nialloh23 opened this issue 4 years ago • 5 comments

Context:
You have built your index using all of your candidate & query embeddings.
And now you want to get the similarity score between either:

  • A single user (1021314) & a subset list of N candidates ['TopGun', 'Star Wars', 'The Titantic']

There is a requirement to do this fast so batch/vectorized predictions are required. Aside: This is a pattern I've seen a need for several times so just flagging here in case it's helpful.

Question: What is the best way to do this?

Options considered

  1. Exclude all candidates except your N=3 candidates your concerned about:
for row in ratings.batch(300).take(1):
  top_movies = brute_force.query_with_exclusions(row, exclusions=row['exclusions'])[1].numpy()
  1. Manually extract the user & candidate embeddings from the model and just handle this myself via simple dot products or cosine similarity between the embeddings *[errorr prone] *
  2. Just get the scores/rank for all of my candidates and filter results by index lookup after the fact [very slow]

Option 1 feels cleanest as it handles all vocabulary/index mappings and leaves less room for error.
The one downside is that when I have a large number of candidates (e.g. 500,000) it means I have to have a row in my dataset with (500,000 - N(3)) candidates to exclude for each row.

I guess I was wondering if it is possible to instead of specifying exclusions to actually specify a subset of candidates to consider/include for each row/prediction. This would help us out a huge amount with several models we have in production (e.g. Find the most similar "best fitting" sku size for a user for a given fashion item ----> subset of candidates is N=14 in this case rather than our entire inventory)

I hope this all makes sense!

nialloh23 avatar Jul 30 '21 12:07 nialloh23

I think my answer depends on how large the number of items you want to exclude (num_exclude) is, relative to the total number of items (N).

  1. If num_exclude is much smaller than N, a single index with post-filtering is probably best. Use case: recommend best items for a user with things they have already purchased removed.
  2. If num_exclude is very close to N, you need a different solution. For example, you'd maintain a random access database (Postgres, Redis, etc) that maps product ids to embeddings. To tackle your example of picking the best SKU out of 14, you'd first determine what 14 you are interested in, then look up their embeddings in the database, then do the scoring.

Constructing the database in (2) shouldn't be too error prone - you wouldn't have to handle vocabs etc manually. You could do:

for product_id, product_features in product_data_dataset:
  embedding = item_model(product_features)
  db.insert(product_id, embedding)

maciejkula avatar Aug 06 '21 16:08 maciejkula

Thanks @maciejkula My use case is more like (2) where the number of items to exclude is very close to N (or put another way the number of items to include is small (e.g. 16)). The solution I went with is along the lines the 2nd option you have outlined.

  • I first manually extract the candidate & query embeddings from the trained model
  • I created my own index lookup for mapping embeddings & ids
  • I then calculate scores manually myself via the dot product of the embeddings (in a batch/vectorized manner)

The solution works but involves a lot of extra manual code outside of the framework.
I'm not sure if building something like this (i.e. include_styles list) natively into something like the brute force index lookup fits within the remit of the framework but would definitely help us in a number of our use cases.

Thanks again for taking the time to give guidance on this.

nialloh23 avatar Aug 09 '21 14:08 nialloh23

To do (1) it's easiest to build a BruteForce layer, then get the _candidates and _identifiers attributes (if memory serves, please double-check). This will at least ensure that no errors happen at that stage.

One extension we could consider is an easier API for retrieving subsets of embeddings:

index.embeddings(candidate_ids: List[str]) -> tf.Tensor

Would that be useful?

maciejkula avatar Aug 09 '21 18:08 maciejkula

"The easiest to build a BruteForce layer".....that's exactly how I've done it. I believe you pointed me in this direction several months ago and this advice has come in handy several times since :)

I love the idea for an API that allowed retrieving a subset of embeddings. That definitely help streamline things

nialloh23 avatar Aug 10 '21 19:08 nialloh23

hey @maciejkula -- wondering if there any updates on this feature! really appreciate your help!

violetdang1 avatar Jan 06 '23 21:01 violetdang1