models [BUG] Evaluation scores of topk_encoder.evaluate(...) are inconsistents

Bug description

I observed inconsistent evaluation metrics when running the integration tests with the new API: The first call of topk_encoder.evaluate() has some internal states in metrics that make the evaluation score very high.

1 - EVALUATION METRICS [1st call]:  0.3419625461101532

2 - EVALUATION METRICS [2nd call]:  0.039509184658527374

3 - EVALUATION METRICS [3rd call]:  0.039509184658527374

4 - MANUAL TOP-K PREDICTION - RECALL@100 = 0.03953236607142857

Steps/Code to reproduce bug

Get the data from this drive
Pull the code in PR #790
Add these lines to the integration test ( here )

            # """
            # ########### 3 - Evaluation score from top-k encoder - 2nd call ############
            # """
            # Evaluate on valid set
            eval_loader = mm.Loader(
                self.eval_ds,
                batch_size=self.eval_batch_size,
                transform=mm.ToTarget(self.eval_ds.schema, item_id_name),
                shuffle=False,
            )
            eval_metrics = recommender.evaluate(
                eval_loader,
                batch_size=self.eval_batch_size,
                return_dict=True,
                callbacks=self.callbacks,
            )
            print("3 - EVALUATION METRICS: ", eval_metrics["recall_at_100"])
            # """


            # ########### 4 - MANUALLY COMPUTING TOP-K PREDICTIONS ############
            from merlin.models.tf.utils import tf_utils

            def numpy_recall(labels, top_item_ids, k):
                return np.equal(np.expand_dims(labels, -1), top_item_ids[:, :k]).max(axis=-1).mean()

            eval_loader = mm.Loader(self.eval_ds, batch_size=self.eval_batch_size, shuffle=False)
            item_embeddings = self.model.candidate_embeddings(
                item_dataset, index=Tags.ITEM_ID, batch_size=4096
            )
            item_embeddings = item_embeddings.to_ddf().compute()
            values = tf_utils.df_to_tensor(item_embeddings)
            ids = tf_utils.df_to_tensor(item_embeddings.index)

            recall_at_100_list = []
            for batch, target in eval_loader:
                batch_item_tower_embeddings = self.model.candidate_encoder(batch)
                batch_query_tower_embeddings = self.model.query_encoder(batch)
                positive_scores = tf.reduce_sum(
                    tf.multiply(batch_item_tower_embeddings, batch_query_tower_embeddings), axis=-1
                )

                batch_user_scores_all_items = tf.matmul(
                    batch_query_tower_embeddings, values, transpose_b=True
                )
                top_scores, top_indices = tf.math.top_k(batch_user_scores_all_items, k=100)
                top_ids = tf.squeeze(tf.gather(ids, top_indices))

                batch_pos_item_id = tf.squeeze(batch["track_id"])
                recall_at_100 = numpy_recall(batch_pos_item_id, top_ids, k=100)
                recall_at_100_list.append(recall_at_100)

            print(f"4 - MANUAL TOP-K PREDICTION - RECALL@100 = {np.mean(recall_at_100_list)}")
            # """

Expected behavior

Getting consistent scores when calling topk_encoder.evaluate()

Oct 24 '22 16:10 sararb

shall we create a bug ticket on tf.keras repo? need to create a repro exp first.

Oct 26 '22 16:10 rnyak

fixed by #830

Oct 28 '22 12:10 sararb