models
models copied to clipboard
[BUG] Evaluation scores of topk_encoder.evaluate(...) are inconsistents
Bug description
I observed inconsistent evaluation metrics when running the integration tests with the new API: The first call of topk_encoder.evaluate() has some internal states in metrics that make the evaluation score very high.
1 - EVALUATION METRICS [1st call]: 0.3419625461101532
2 - EVALUATION METRICS [2nd call]: 0.039509184658527374
3 - EVALUATION METRICS [3rd call]: 0.039509184658527374
4 - MANUAL TOP-K PREDICTION - RECALL@100 = 0.03953236607142857
Steps/Code to reproduce bug
-
Get the data from this drive
-
Pull the code in PR #790
-
Add these lines to the integration test ( here )
# """
# ########### 3 - Evaluation score from top-k encoder - 2nd call ############
# """
# Evaluate on valid set
eval_loader = mm.Loader(
self.eval_ds,
batch_size=self.eval_batch_size,
transform=mm.ToTarget(self.eval_ds.schema, item_id_name),
shuffle=False,
)
eval_metrics = recommender.evaluate(
eval_loader,
batch_size=self.eval_batch_size,
return_dict=True,
callbacks=self.callbacks,
)
print("3 - EVALUATION METRICS: ", eval_metrics["recall_at_100"])
# """
# ########### 4 - MANUALLY COMPUTING TOP-K PREDICTIONS ############
from merlin.models.tf.utils import tf_utils
def numpy_recall(labels, top_item_ids, k):
return np.equal(np.expand_dims(labels, -1), top_item_ids[:, :k]).max(axis=-1).mean()
eval_loader = mm.Loader(self.eval_ds, batch_size=self.eval_batch_size, shuffle=False)
item_embeddings = self.model.candidate_embeddings(
item_dataset, index=Tags.ITEM_ID, batch_size=4096
)
item_embeddings = item_embeddings.to_ddf().compute()
values = tf_utils.df_to_tensor(item_embeddings)
ids = tf_utils.df_to_tensor(item_embeddings.index)
recall_at_100_list = []
for batch, target in eval_loader:
batch_item_tower_embeddings = self.model.candidate_encoder(batch)
batch_query_tower_embeddings = self.model.query_encoder(batch)
positive_scores = tf.reduce_sum(
tf.multiply(batch_item_tower_embeddings, batch_query_tower_embeddings), axis=-1
)
batch_user_scores_all_items = tf.matmul(
batch_query_tower_embeddings, values, transpose_b=True
)
top_scores, top_indices = tf.math.top_k(batch_user_scores_all_items, k=100)
top_ids = tf.squeeze(tf.gather(ids, top_indices))
batch_pos_item_id = tf.squeeze(batch["track_id"])
recall_at_100 = numpy_recall(batch_pos_item_id, top_ids, k=100)
recall_at_100_list.append(recall_at_100)
print(f"4 - MANUAL TOP-K PREDICTION - RECALL@100 = {np.mean(recall_at_100_list)}")
# """
Expected behavior
Getting consistent scores when calling topk_encoder.evaluate()
shall we create a bug ticket on tf.keras repo? need to create a repro exp first.
fixed by #830