models icon indicating copy to clipboard operation
models copied to clipboard

[BUG] Target in schema causes incorrect down scoring of false negative.

Open angmc opened this issue 3 years ago • 3 comments

Bug description

Target in schema causes incorrect down scoring of false negative. With target in the schema, the positive item ids are all made the values of the target column resulting in incorrect down scoring. All items that aren't item_id 1 are seen as valid negatives.

Steps/Code to reproduce bug

import random
import pandas as pd
import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags
from tensorflow.keras.utils import unpack_x_y_sample_weight
import merlin.models.tf.dataset as tf_dataloader
import merlin.models.tf as mm
from merlin.io.dataset import Dataset
import tensorflow as tf
#create dummy data and workflow
columns = ['item_id','user_id','itemfeat','userfeat']
nrows = 4
df = pd.DataFrame({columns[x]:[random.randint(0,10) for _ in range(nrows)] for x in range(nrows)},index=range(nrows))
df['target'] = 1
targets = ["target"] >> AddMetadata(tags=[Tags.BINARY_CLASSIFICATION,Tags.TARGET])
user_id = ['user_id'] >> AddMetadata(tags=[Tags.USER_ID])  >> Categorify() 
item_id = ['item_id'] >> AddMetadata(tags=[Tags.ITEM_ID]) >> Categorify()
item_features = ['itemfeat'] >> AddMetadata(tags=[Tags.ITEM]) >> Categorify()
user_features = ['userfeat'] >> AddMetadata(tags=[Tags.USER]) >> Categorify()
outputs =  user_id+item_id+item_features+user_features + targets 
workflow = nvt.Workflow(outputs)
df = Dataset(df)
workflow.fit(df)
df  = workflow.transform(df)

#uncomment to see expected final output
# df.schema = df.schema.remove_col('target')


#define model
model = mm.TwoTowerModel(
    df.schema,
    query_tower=mm.MLPBlock([64,64], no_activation_last_layer=True,),
    item_tower=mm.MLPBlock([64,64],no_activation_last_layer=True))
model.compile(
    optimizer='adam',run_eagerly=False,loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True,),
    metrics=[])

#get sample to predict on 
data =  tf_dataloader.BatchedDataset(df,batch_size = 5,shuffle = False )
x = next(data)
x, y, sample_weight = unpack_x_y_sample_weight(x)


# get training output 
pred = model(x,targets=y,
            training=True)

positive_item_ids=<tf.Tensor: shape=(4, 1), dtype=int64, numpy= array([[1], [1], [1], [1]])>

Expected behavior The positive_item_ids should be the same values as the negative_item_ids

angmc avatar Sep 01 '22 15:09 angmc

@bschifferer , could you confirm if this is a P0 or P1. the priority field said P0 and the label was P1

viswa-nvidia avatar Sep 12 '22 16:09 viswa-nvidia

@sararb to avoid such issue, do you think we can add a check script in class ItemRetrievalScorer(Block) that will check if the schema has a target column tagged as target, if yes, will generate a warning? thanks.

rnyak avatar Sep 14 '22 14:09 rnyak

The model could also just use a copy of the schema object and if target in schema schema = schema.remove_col('target')

angmc avatar Sep 14 '22 15:09 angmc