[Implementation Question] Movie/ProductModel containing more than just e.g. 'movie_titles'
The examples usually state:
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
})
movies = movies.map(lambda x: x["movie_title"])
In my current dataset, I do not have movies but products. As such I am interested in certain fields (Catalog, Manufacturer, Color, Size, Price). I've managed to create a final ratings DataFrame such as this:
The methodology follows the GRU4Rec paper as mentioned in one of my previous posts (#306), but since this issue is about something different, I decided to make a new thread. I am referring to it as context. Basically we have the Sequence part where it contains the Sequences of Items and the Final part where it contains information about the last item.
What I am to achieve here is something like this:
ratings = ratings.map(lambda x: {
"SEQUENCE": x["SEQUENCE"],
"SEQUENCE_CD": x["SEQUENCE_CD"],
...,
})
product = tf_products.map(lambda x: {
"PRODUCT": x["COMBINED_ID"],
"PRODUCT_CD": x["PRODUCT_CATEG_CD"],
"PRODUCT_MANUFACTURER": x["PRODUCT_MANUFACTURER"],
"PRODUCT_COLOR": x["PRODUCT_COLOR"],
"PRODUCT_SIZES": x["PRODUCT_SIZES"],
"PRODUCT_PRICE": x["PRODUCT_PRICE"],
})
I realize by the examples that this is possible. Indeed I've created the Sequence and Product Model appropriately and my model is fitting nicely. My problem is basically Retrieval and specifically in the following line.
index.index(product.batch(100).map(model.product_model, product))
Generally, in the movie example where only the movie title is mapped, the movies would become a <MapDataset shapes: (), types: tf.string>. Here however (for now I've tried including up to PRODUCT_CD), product becomes <MapDataset shapes: {PRODUCT: (), PRODUCT_CD: ()}, types: {PRODUCT: tf.string, PRODUCT_CD: tf.string}>.
This Dictionary essentially ruins things over, as the line mentioned above returns an error (basically due to the Dictionary). Checking the source code (tfrs.layers.factorized_top_k.BruteForce), I've realized this is the line causing the issue:
identifiers = tf.concat(list(identifiers), axis=0).
Here is where I am kinda lost. Basically what I want to do is feed the model with additional information (through Manufacturer/Color/Size/Price info). What I am also wondering is whether me placing these on product and not rating is correct. For example, I've also thought about the following scenario:
ratings = tf_ratings.map(lambda x: {
"SEQUENCE": x["SEQUENCE"],
"SEQUENCE_CD": x['SEQUENCE_CD'],
"SEQUENCE_MAN": x['SEQUENCE_MAN'],
"SEQUENCE_COL": x['SEQUENCE_COL'],
"SEQUENCE_SIZ": x['SEQUENCE_SIZ'],
"SEQUENCE_PRI": x['SEQUENCE_PRI'],
"TIMESTAMP": x['TIMESTAMP'],
"PRODUCT": x['PRODUCT'], # <- PRODUCT here is basically FINAL, simply renamed
"PRODUCT_CD": x['PRODUCT_CD'],
"PRODUCT_MAN": x['PRODUCT_MAN'],
"PRODUCT_COL": x['PRODUCT_COL'],
"PRODUCT_SIZ": x['PRODUCT_SIZ'],
"PRODUCT_PRI": x['PRODUCT_PRI'],
"SCORE": x['SCORE'],
})
product = tf_products.map(lambda x: x["COMBINED_ID"])
In the end however, I would also like to be able to associate categories/catalogs between themselves ( _CD ), but at this point I can only associate products together (Combined ID). I cannot really wrap my head around that one so I decided to include it too as a bonus question.
My understanding is this. If product contains only COMBINED_ID then there is only 1 item embedding, the ID of the item (or every item embedding is based off the ID of the product, which in movie title terms could mean an additional TextVectorization as stated in the examples). Additionally, there are multiple user/sequence embeddings that help the system figure out the similarities between different sequences. In that regard, I fail to associate categories/sizes/colors between the items directly and I only take into account this information in the sequences.
On the other hand, if I where to include multiple embeddings in the ProductModel, then the manufacturers, available colors and (mostly important) categories would directly associate different items together and I would be able to discern things such as the most similar categories (would also like an answer here if possible).
This is how I do it
index = tfrs.layers.factorized_top_k.BruteForce(self.model.query_model)
index.index_from_dataset(tf.data.Dataset.zip((unique_product_id_for_inference.batch(1000),products.batch(1000).map(self.model.candidate_model))))
There have been some API changes since this issue was first posted. Indexing a Retrieval layer is now achieved with the index_from_dataset method.
This method expects a dataset that contains a tuple of (id, embedding_vector).
In order to prepare this you typically have a dataset with all of your items features and an item model which takes a dictionary of features and returns the embedding vector for the item.
We can index the layer like so
index.index_from_dataset(
items_dataset.batch(1024).map(lambda x: (x['item_id'], item_model(x)))
)