Merlin [RMP] Support pre-trained vector embeddings as features

Problem:

Customers need a way to load embeddings that have been pretrained or trained from separate models into the model. See https://github.com/NVIDIA-Merlin/Merlin/issues/471

Goal:

Enable dataloading of separate embedding tables without having to add these embeddings into the interaction data during training. For serving those embeddings need to be provided in the request to the model.

Constraints:

[ ] External embedding tables may not fit on GPU.
[ ] Non-trainable embeddings
[ ] Fits in CPU memory, Larger than CPU memory is left for potential future work
[ ] Not generating the embedding on the fly (future work)

Starting Point:

[ ] JoinExternal Op
[ ] Tensor in UVM?
[ ] Merlin KV?

Supporting pre-trained vector embeddings as features would provide baseline support for multi-modal use cases that rely on outside models to generate image/text embeddings.

Dataloader

[ ] Add lookup for embeddings based on key during dataloading
[ ] Add pretrained embedding to the dictionary of tensors
[ ] Support aggregation functions used to combine in model item embeddings?
[ ] Support for 3d tensors (fixed size of sequence) to support session based in TF4Rec and MM

Transformers4Rec & Models

[ ] Evaluate the transforms that happen to an embedding after it's pulled from the embedding table so that pretrained embeddings can be processed in the same way.
[ ] [Support pre-trained embeddings and create transforms for processing pretrained embeddings in T4R ] (https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/485)

Merlin Systems

[ ] Add embedding to the recommendation request (from feature store, KV, etc?)

Examples

[ ] Example of dataloading pretrained embeddings in Transformers4Rec
[ ] Example of dataloading pretrained embeddings in Merlin Models

Apr 14 '22 15:04 karlhigley

Ok, this issue now makes much more sense to me 🙂 I created a PR NVIDIA-Merlin/models#508 but I think this is just a tiny step on this. Not sure what would be the logical next step here.

I certainly need to continue to bring myself up to speed with Merlin Models, I still only have a very narrow understanding of all the components and how they fit together, but regardless, I wonder what the next steps on this could be? @karlhigley, if you could offer a suggestion, that would be greatly appreciated 🙂 This is my first run-in with an RMP issue

Jun 13 '22 04:06 radekosmulski

I'm honestly not entirely sure either! I captured this issue because I heard you were already working on it, but it's mostly a placeholder for a discussion on the scope of what we'd want to do and where that falls in terms of our team priorities. I don't think we've had that conversation yet, and I'm not entirely sure how/where it would happen either (given time zones etc.)

Jun 13 '22 23:06 karlhigley

I put your face on it less to signal that you're responsible for the whole thing (I don't think you are), and more to signal that you'd be the person who is already doing relevant work and probably would have worthwhile thoughts about what we ought to be able to do with pre-trained embeddings.

Jun 13 '22 23:06 karlhigley

Thank you very much @karlhigley for these thoughts, they are very helpful! 🙂 Makes a lot of sense.

JUst wanted to reference NVIDIA-Merlin/models#508 -- we now have a use case for using pretrained embeddings, but don't have a good way of freezing them I believe. Would be very good to have this option as this is what likely most users would want.

Jun 14 '22 01:06 radekosmulski

@EvenOldridge @karlhigley we now have an example for using pre-trained embeddings in MMs, and have a way of freezing them. fyi.

Aug 17 '22 16:08 rnyak

https://github.com/NVIDIA-Merlin/Merlin/issues/471 has details on the customer request side.

Aug 17 '22 16:08 EvenOldridge

NVIDIA-Merlin/Merlin#471 has details on the customer request side.

@EvenOldridge yes we need this for TF4Rec. And I created this ticket https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/475 for that.

Aug 18 '22 17:08 rnyak

@EvenOldridge If I'm understanding correctly, it sounds like the underlying customer request involves the dataloaders, the T4R library itself, and Merlin Systems (but not NVT.) Would it make sense to scope this issue more tightly to the customer request and punt additional features to a subsequent issue?

Sep 02 '22 21:09 karlhigley

It also sounds like the customer request necessarily involves having PyTorch serving for T4R worked out. Assuming that the (known-to-be-slow) Python serving isn't sufficient, sounds like we'll need to work out the issues with Torchscript serving.

Sep 02 '22 21:09 karlhigley

To my best knowledge, TensorFlow has a warm start mechanism as a similar function. I think they have a meaningful design; maybe we can be inspired by it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/warm_starting_util.py#L419 I know some end-users are using these APIs for pre-training, and the regular expression can give the user more convenience.

Sep 13 '22 00:09 rhdong

ToDo: How to integrate pre-trained embedding in schema file (tagging) and is used in architecture definition

Oct 17 '22 21:10 bschifferer

How to integrate pre-trained embedding in schema file (tagging)

Adding Tags.EMBEDDING as a "prefab" tag in the Merlin Core schema implementation seems like it could make sense 👍🏻

Oct 20 '22 14:10 karlhigley

When the embedding table are not huge and fit GPU memory, the new PretrainedEmbeddingsInitializer ( https://github.com/NVIDIA-Merlin/Transformers4Rec/pull/572 ) can be used to initialize the embedding matrix with pre-trained embeddings and set them to trainable or not.

Dec 08 '22 16:12 gabrielspmoreira

I am not sure, if the main ticket is uptodate. In some meetings, we say, that the feature is almost done but there are many Tickets not checked (finished). I looked into pre-trained embedding functionality of the dataloader and tried to provide a simple example for a minimal definition of done. That doesn't mean, that this simple example represents the definition of done - that's how I imagine to use this feature.

I did only looked at the TensorFlow side and haven't tested the PyTorch side (assuming it works the same)?

Open ToDos (from my point of view):

BUG: Get an key error when combining target with pre-trained embeddings : KeyError: 'target'
BUG: Sequence Features are not embedded correctly
FEATURE: Convert input columns to emb_ids ( nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) ) - this is similar to a request with have for GTC Recommender. I am not sure, if we want to do this in NVTabular OR if we apply this mapping in the dataloader
FEATURE: Merlin Models needs to use pre-trained embedding in the model architecture and use it for training. This should work for ranking models, retrieval models and session-based models. (For special architectures, such as DLRM, it should throw a meaningful error, if the pre-trained embedding do not fit)
FEATURE: Transformers4Rec needs to use pre-trained embedding in the model architecture and use it for training.
~FEATURE: Schema Object needs to represent the pre-trained embedding functionality, that MM and Transformers4Rec knows that this feature is a pre-trained embedding (more below)~ -> does already exist with dataloader.output_schema
[FEATURE: (Not sure if it does already exist) - provide the embeddings during serving]

I will explain more my assumptions and proposed open ToDos:

My assumption is that the user have a downstream process to generate embeddings (np_emb1 and np_emb2). I am not sure, if we can assume that the IDs in the dataset are matching the order of the numpy arrays. I assume there will be mapping tables to convert them (emb1_map and emb1_map). Either in NVT or dataloader, we should provide the functionality to map the input data to the IDs of the pre-trained embeddings.
~MM and Transformers4Rec defines the neural network architecture. They rely on the schema object. The current usability to set pre-trained embeddings in the dataloader as transforms do not modify the schema object. Therefore, MM and Transformers4Rec cannot know that they expect pre-trained embeddings. We need to modify the schema object to make this change aware. PROPOSAL (see code comments): We add the information to the schema object (e.g. schema['emb_id_1'].add(PreTrain(np_emb1, lookup_key='emb_id_1', embedding_name='emb_id_1'). It would be great, if we do not need to repeat the information in the dataloader (however, we cannot store the numpy object in the schema, so I guess, we need at least to provide the numpy object to the dataloader).~

BUGs:

If you uncomment #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET]), next(iter(dataloader)) will fail

import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

import glob

from merlin.io import Dataset
from merlin.loader.tensorflow import Loader
from merlin.schema import Tags
from merlin.schema.tags import Tags

import numpy as np
import pandas as pd

import nvtabular as nvt
import merlin.models.tf as mm

import cudf

from merlin.dataloader.ops.embeddings import (  # noqa
    EmbeddingOperator,
    MmapNumpyEmbedding,
    NumpyEmbeddingOperator,
)

### Input
np_emb1 = np.random.rand(1000,10)
np_emb2 = np.random.rand(1000,20)
emb1_map = {
    10: 0,
    11: 1,
    12: 2,
    13: 3
}
emb2_map = {
    'a': 0,
    'b': 1,
    'c': 2,
    'd': 3
}
df = cudf.DataFrame({
    'emb_id_1': [10, 12, 11, 12, 11, 13],
    'emb_id_2': ['a', 'd', 'c', 'a', 'd', 'b'],
    'cat1': [1,5,6,3,5,7],
    'cat2': ['a', 'a', 'd', 'e', 'f', 'g'],
    'target': [0,1,1,0,1,0]
})

# NVTabular Workflow
emb1 = ['emb_id_1'] >> nvt.ops.LambdaOp(lambda x: x.map(emb1_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
emb2 = ['emb_id_2'] >> nvt.ops.LambdaOp(lambda x: x.map(emb2_map)) >> nvt.ops.AddTags([Tags.CATEGORICAL])
cats = ['cat1', 'cat2'] >> nvt.ops.Categorify()
target = ['target'] #>> nvt.ops.AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, Tags.TARGET])

features = emb1+emb2+cats+target
workflow = nvt.Workflow(features)

ds = Dataset(df)
workflow.fit(ds)
ds_transformed = workflow.transform(ds)
ds_transformed.compute()

data_loader = Loader(
    ds_transformed,
    batch_size=2,
    transforms=[
        NumpyEmbeddingOperator(
            np_emb1,
            lookup_key='emb_id_1',
            embedding_name='emb_id_1'
        ), 
        NumpyEmbeddingOperator(
            np_emb2, 
            lookup_key='emb_id_2',
            embedding_name='emb_id_2'
        )
    ],
    shuffle=False,
)
next(iter(dataloader))
model = mm.Model.from_block(
    mm.MLPBlock([64, 32]),
    data_loader.output_schema, 
    prediction_tasks=mm.BinaryOutput('target')
)
model.compile()
model.fit(data_loader)

Session-Based Bug: I do not know, if session-based is in the scope (given that Transformers4Rec is mentioned, I guess yes?). Although there are only 2x examples, the emb tensors is [6, 10] - it does not keep the sequential structure. I do not know what the representation is, but I think we might need to convert it to __values and __offsets (and the offsets are missing)?

emb = np.random.rand(1000,10)
df = cudf.DataFrame({
    'idx': [0,1,2,3,4,5,6,7,8,9],
    'id1': [[0, 1], [1,2,3,4],[2],[3],[4],[5],[6],[8],[9],[10]]
})

dataset = Dataset(df)
schema = dataset.schema
for col_name in ['id1']:
    schema[col_name] = schema[col_name].with_tags(Tags.CATEGORICAL)
dataset.schema = schema
embeddings_np = emb
data_loader = Loader(
    dataset,
    batch_size=2,
    transforms=[NumpyEmbeddingOperator(
        embeddings_np, 
        lookup_key='id1',
        embedding_name='emb'
    )],
    shuffle=False,
)
next(iter(data_loader))

Apr 19 '23 12:04 bschifferer

@sararb to update this ticket

May 02 '23 17:05 viswa-nvidia

Merlin Merlin copied to clipboard

[RMP] Support pre-trained vector embeddings as features

Problem:

Goal:

Constraints:

Starting Point:

Dataloader

Transformers4Rec & Models

Merlin Systems

Examples

Merlin
Merlin copied to clipboard