[BUG] Target in schema causes incorrect down scoring of false negative.
Bug description
Target in schema causes incorrect down scoring of false negative. With target in the schema, the positive item ids are all made the values of the target column resulting in incorrect down scoring. All items that aren't item_id 1 are seen as valid negatives.
Steps/Code to reproduce bug
import random
import pandas as pd
import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags
from tensorflow.keras.utils import unpack_x_y_sample_weight
import merlin.models.tf.dataset as tf_dataloader
import merlin.models.tf as mm
from merlin.io.dataset import Dataset
import tensorflow as tf
#create dummy data and workflow
columns = ['item_id','user_id','itemfeat','userfeat']
nrows = 4
df = pd.DataFrame({columns[x]:[random.randint(0,10) for _ in range(nrows)] for x in range(nrows)},index=range(nrows))
df['target'] = 1
targets = ["target"] >> AddMetadata(tags=[Tags.BINARY_CLASSIFICATION,Tags.TARGET])
user_id = ['user_id'] >> AddMetadata(tags=[Tags.USER_ID]) >> Categorify()
item_id = ['item_id'] >> AddMetadata(tags=[Tags.ITEM_ID]) >> Categorify()
item_features = ['itemfeat'] >> AddMetadata(tags=[Tags.ITEM]) >> Categorify()
user_features = ['userfeat'] >> AddMetadata(tags=[Tags.USER]) >> Categorify()
outputs = user_id+item_id+item_features+user_features + targets
workflow = nvt.Workflow(outputs)
df = Dataset(df)
workflow.fit(df)
df = workflow.transform(df)
#uncomment to see expected final output
# df.schema = df.schema.remove_col('target')
#define model
model = mm.TwoTowerModel(
df.schema,
query_tower=mm.MLPBlock([64,64], no_activation_last_layer=True,),
item_tower=mm.MLPBlock([64,64],no_activation_last_layer=True))
model.compile(
optimizer='adam',run_eagerly=False,loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True,),
metrics=[])
#get sample to predict on
data = tf_dataloader.BatchedDataset(df,batch_size = 5,shuffle = False )
x = next(data)
x, y, sample_weight = unpack_x_y_sample_weight(x)
# get training output
pred = model(x,targets=y,
training=True)
positive_item_ids=<tf.Tensor: shape=(4, 1), dtype=int64, numpy= array([[1], [1], [1], [1]])>
Expected behavior The positive_item_ids should be the same values as the negative_item_ids
@bschifferer , could you confirm if this is a P0 or P1. the priority field said P0 and the label was P1
@sararb to avoid such issue, do you think we can add a check script in class ItemRetrievalScorer(Block) that will check if the schema has a target column tagged as target, if yes, will generate a warning? thanks.
The model could also just use a copy of the schema object and if target in schema schema = schema.remove_col('target')