[BUG] model.evaluate() and topk_model.evaluate() do not give same metric (case of no sampled softmax)
Bug description
I am running experiments using sampled softmax: I am using new method (topk_encoder) to evaluate when am using sampled softmax. However, when am not using sampled softmax, I expected both the below methods to return the same metrics, however they do not.
Steps/Code to reproduce bug
Original way of evaluation:
predict_last = mm.SequenceMaskLast(schema=seq_schema, target=target, transformer=xlnet_block)
eval_results = model_transformer.evaluate(
valid_ds,
batch_size=512,
pre=predict_last,
return_dict=True
)
New way of evaluation:
target = train_ds_schema.select_by_tag(Tags.ITEM_ID).first
max_k = 10
topk_model = model_transformer.to_top_k_encoder(k=max_k)
topk_model.compile(run_eagerly=False)
loader = mm.Loader(valid_ds, batch_size=512)
eval_results = topk_model.evaluate(loader, return_dict=True, pre=predict_last)
Expected behavior
In the case of no sampled softmax, expected both the above codes to provide the same result.
Environment details
- Merlin version: 23.05 TF
- Tensorflow version (GPU?): 2.11.0+nv23.2
Additional context
REPRODUCIBLE EXAMPLE (from Ronay):
import os
import itertools
import numpy as np
import tensorflow as tf
import merlin.models.tf as mm
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io import Dataset
from merlin.schema import Tags
from merlin.datasets.synthetic import generate_data
sequence_testing_data = generate_data("sequence-testing", num_rows=100)
sequence_testing_data.schema = sequence_testing_data.schema.select_by_tag(
Tags.SEQUENCE
).select_by_tag(Tags.CATEGORICAL)
seq_schema = sequence_testing_data.schema
item_id_name = seq_schema.select_by_tag(Tags.ITEM).first.properties['domain']['name']
target = sequence_testing_data.schema.select_by_tag(Tags.ITEM_ID).column_names[0]
query_schema = seq_schema
output_schema = seq_schema.select_by_name(target)
d_model = 48
BATCH_SIZE = 32
dmodel = int(os.environ.get("dmodel", '48'))
input_block = mm.InputBlockV2(
query_schema,
embeddings=mm.Embeddings(
seq_schema.select_by_tag(Tags.CATEGORICAL),
sequence_combiner=None,
dim=dmodel
))
xlnet_block = mm.XLNetBlock(d_model=dmodel, n_head=2, n_layer=2)
def get_output_block(schema, input_block=None):
candidate_table = input_block["categorical"][item_id_name]
to_call = candidate_table
outputs = mm.CategoricalOutput(to_call=to_call)
return outputs
output_block = get_output_block(seq_schema, input_block=input_block)
projection = mm.MLPBlock(
[128, output_block.to_call.table.dim],
no_activation_last_layer=True,
)
session_encoder = mm.Encoder(
input_block,
mm.MLPBlock([128, dmodel], no_activation_last_layer=True),
xlnet_block,
projection,
)
model = mm.RetrievalModelV2(query=session_encoder, output=output_block)
optimizer = tf.keras.optimizers.Adam(
learning_rate=0.005,
)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
model.compile(
run_eagerly=False,
optimizer=optimizer,
loss=loss,
metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[10])
)
model.fit(
sequence_testing_data,
batch_size=32,
epochs=1,
pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.3, transformer=xlnet_block)
)
predict_last = mm.SequenceMaskLast(schema=seq_schema, target=target, transformer=xlnet_block)
model.evaluate(
sequence_testing_data,
batch_size=BATCH_SIZE,
pre=predict_last,
return_dict=True
)
Once this is run, now please run the following and compare the metric values coming from model.evaluate() above and topk_model.evaluate below. The results do not match.
loader = mm.Loader(sequence_testing_data, batch_size=BATCH_SIZE)
max_k = 10
topk_model = model.to_top_k_encoder(k=max_k)
topk_model.compile(run_eagerly=False)
metrics = topk_model.evaluate(loader, return_dict=True, pre=predict_last)
metrics
Please note that the metrics results change each time we rerun topk_model.evaluate(). I added shuffle=False in the loader, but I still get metrics values different.