How to match embedding dimensions for different cardinalities of user and item sets?

Open hkristof03 opened this issue 3 years ago • 1 comments

Thanks for the developers and for the community for the library and the excellent questions and answers.

According to this source, it is usually a good rule of thumb to specify the embedding dimension as the 4th square root of the number of unique categories. Let's say I have a user space with a cardinality of 400k. The cardinality of the item space is 25k. I would trivially expect that the former requires a larger number for the embedding dimension. If we apply the rule of thumb, the emedding dimension for the Query tower would be around 25, while for the Candidate tower it would be ~13. Let's say that we also add context features to the user and item embeddings, so suppose that the final vectors would be of size 40 and 25, respectively.

My questions is, how to compute the inner product between these vectors? Should I add 1-s to the shorter one, or is there any solution applied in this case?

Another question is related to the size of the embeddings. With the above mentioned cardinalities, the two embedding matrices would be of sizes of 400_000 x 25 and 25_000 x 13. Is there any practical solutions how to make the training faster, instead of trainng such a large embeddings? I was thinking about to split the user space in a stratified way and train different models for each of the user subsets.

Also, because the item space is much smaller than the user space, the concerns described in #279 could be much relevant for this use-case, as the user-item interactions are not so sparse here, as in the case of youtube user-item interactions, for example.

Jun 14 '22 12:06 hkristof03

Hi @hkristof03 :wave: In the end, your query and candidate embeddings need to be the same size. People often achieve this with dense layer(s) after your feature embeddings. However, if your only input is the user id, then you should just set your user and item embeddings matrices to the same dimensionality. You would then hyper-parameter tune to find the optimal dimensionality.

To your second question, an embedding matrix of 400K is not particularly large and training on subsets is not a good idea. It will reduce the ability for your model to learn from the collective behaviour of all users. One thing you can try, is to embed each user based on the average vector of their previous positive item interactions, rather than by using their user id. In this case you have a single embedding matrix of (25000, d) that is used by both your query and candidate tower.

I wouldn't worry too much about the problem not being sparse enough at this stage. Even if the average number of positive interactions per user is 200, then sparsity of the problem is 99.2%.

Jun 22 '22 01:06 patrickorlando