models icon indicating copy to clipboard operation
models copied to clipboard

[QST] How to use pretrained embeddings as features in DLRM?

Open aabdullah-getguru opened this issue 2 years ago • 11 comments

❓ Questions & Help

I'm a beginner with Merlin Models. I'm setting up a DLRM model, with 3 types of input features:

  1. categorical features
  2. continuous features
  3. pre_trained embeddings for user/item

For simplicity, we can assume we have a data frame with columns user_id, item_id, categorical_1, continuous_1, embeddings_user, embeddings_item.

(1) and (2) are straightforward to add to the architecture via simply using the right tags and nvt.ops. However, I'm not sure how one could add in the embeddings_1. Is the right approach just to define a custom architecture using the merlin provided blocks? I would prefer these embeddings_1 to be trainable if possible.

Or is there a quicker way to use them with DLRM via the right nvtabular ops and tags? Thanks!

aabdullah-getguru avatar Mar 06 '23 13:03 aabdullah-getguru

@aabdullah-getguru

  • Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

  • Or Are you planning to feed embeddings to embedding layer? If yes, then, please check out this notebook as an example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/entertainment-with-pretrained-embeddings.ipynb

rnyak avatar Mar 06 '23 18:03 rnyak

@aabdullah-getguru

  • Are you planning to feed the embeddings to embedding layer? or use them as an extra continuous input feature? if the latter, you need to aggregate them (e.g. take average). you cannot feed a list of continuous features or list of list continuous features to an MLP model (note that DLRM has bottom MLP for numerica features) yet without aggregation.

if you want to see how you can customize DLRM building blocks you can refer to this example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/06-Define-your-own-architecture-with-Merlin-Models.ipynb

  • Or Are you planning to feed embeddings to embedding layer? If yes, then, please check out this notebook as an example: https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/entertainment-with-pretrained-embeddings.ipynb

@rynak Thank you that's a very helpful lead. I'll look at the embeddings to the embedding layer.

aabdullah-getguru avatar Mar 10 '23 21:03 aabdullah-getguru

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:

item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None

So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector. Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

hkristof03 avatar Sep 21 '23 16:09 hkristof03

Hi @rnyak,

Thanks for the example. I am trying to use embedding vectors from NLP and from CV models. The problem is that these extracted features are sometimes available for each item and sometimes not. I see from the example that if the item_id is missing from the embedding table, the lookup result will be a full zero vector. But I am trying to find out what to do when for example:

item_id | text_embedding | image_embedding
 1      |  None          | [...]
 2      |  [....]        | [...]
 3      |  [....]        | None

So item_id = 1 does not have a text embedding vector while it has an image embedding vector and so on. In this case I the id of 1 cannot be used to retrieve the text embedding vector. Is it the only way to add full zero vectors in the given embedding tables at certain positions, in this case at index zero in the text embedding table, or is there a better method?

Thanks for your help in advance!

Hi, have you found a way to pass precomputed embeddings to the model? I have a very similar case and I cannot understasnd whether it is possible to just use the nvtabular Workflow or other methods to pass both user embeddings and item embeddings. For both I have a 1024 elements array associated with user or item respectively. I believe that these kind of inputs could be of very great help for the model performance but there are lot of memory issues as with ~3M rows this explodes quickly.

The merlin-tensorflow documentation is missing this kind of examples, rather it associates the embeddings to the movieId which I don't understand why is necessary.

CarloNicolini avatar May 30 '24 13:05 CarloNicolini

Hi @CarloNicolini , yes solved the problem. Just follow this example. If the embedding table is large, you have the option not to move it to the GPU, only in batches during training (see the 2nd case in the notebook). Keep in mind that the 0th index of the embedding table should be a full 0. vector which will correspond to unknown IDs. If there are multiple features corresponding to the same embedding table, you can make the embedding table shared for those features with this syntax:

[['feature_x', 'feature_y']] >> nvt.ops. ...

You can verify that the features share the embedding table by checking the schema DataFrame.

I hope this helps.

hkristof03 avatar Jun 01 '24 11:06 hkristof03

``

You can verify that the features share the embedding table by checking the schema DataFrame.

In my case I have the remapped values from categorify starting from 3 (reading the unique.item_id.parquet file). I don't understand why I should only add a single row of zeros instead of three (corresponding to 0,1 and 2) as they are the reserved categories for NULLs, padding and out-of-vocabulary respectively.

CarloNicolini avatar Jun 10 '24 23:06 CarloNicolini

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row.

@rnyak could you please comment on this?

hkristof03 avatar Jun 11 '24 09:06 hkristof03

@CarloNicolini I was also thinking about the same after reading this issue. However, the example I shared only adds one row. @rnyak could you please comment on this?

I've experimented and checked thoroughly the values using Loader.peek() as in the example. I can confirm that one row vector of zeros is not enough, otherwise the data are not correctly aligned. Since my id categorical variable after workflow.transform starts from the value 3 I had to do prepend np.vstack with a np.zeros([3,1024]) in order for the dataloader to pass the pretrained embeddings correctly to the model. P.S. The value 3 clearly is because I use nvt.ops.Categorify with the default num_buckets option. Your mileage may vary, depending on the number of buckets in the categorify, I believe.

CarloNicolini avatar Jun 24 '24 14:06 CarloNicolini

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3. you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

rnyak avatar Jun 28 '24 13:06 rnyak

@CarloNicolini we did not test the Pretrained embedding features with num_buckets option, so it is hard to say it would work out of the box. I'd recommend to use this functionality with applying categorify op without any bucketing or any frequency thresholding. without bucketing you have 1-1 mapping between transformed and original item-ids (or whatever categorical col you apply Categorify op).

Since my id categorical variable after workflow.transform starts from the value 3

if you apply Categorify op on a categorical column, we allocate 0 for padding, nulls are mapped to 1, and OOVs are mapped to 2. Then we start the encoding of the most frequent category in item-id from 3. you should have unique_item_id parquet files inside the categories folder, that you can do reserve mapping.

Thanks for your feedback! With freq_threshold I can confirm that the results seems to map correctly. I've been manually testing some tens of indices and verified that the values correctly map to the values I expect. As for the num_buckets I did not check, mine were only hypotheses.

By the way, this kind of operations strongly call for the need of a nvt.Workflow method .inverse_transform. That would be fantastically useful to perform certain back-mappings.

CarloNicolini avatar Jun 28 '24 17:06 CarloNicolini

@CarloNicolini thanks. Currently we do not have bandwidth to add extra features to the library. If you are interested in, feel free to open a PR.

rnyak avatar Jun 28 '24 18:06 rnyak