models [BUG] The evaluation metrics of YouTubeDNN model is worse after the refactory of model.compile()

Bug description

After the refactory moving loss and metrics to model.compile(), the loss and eval metrics are worse for a YouTubeDNN retrieval models. For the LastFM dataset for example (using the retrieval experiments script), the Recall@100-final dropped from 0.08148 to 0.01429.

The results can be reproduced using the retrieval experiments script, with the following arguments and LastFM dataset:

python scripts/retrieval_train_eval.py --dataset lastfmB --wandb_exp_group lastfmB_youtubednn_xe_sampledsofmax_v07.2 --model_type youtubednn --two_tower_activation selu --epochs 20 --lr_decay_steps 50 --output_path /results --data_path /data --log_to_wandb --optimizer adam --eval_batch_size 2048 --train_metrics_steps 100 --topk_metrics_cutoffs 100 --max_seq_length 20 --youtubednn_sampled_softmax True --fail_if_recall_at_100_higher_than 0.5 --xe_label_smoothing 0.0 --two_tower_mlp_layers None --two_tower_dropout 0.3 --two_tower_embedding_sizes_multiplier 5.0 --logits_temperature 0.8 --lr 0.02238982864512884 --lr_decay_rate 0.9400000000000001 --l2_reg 1.1472035643715902e-05 --embeddings_l2_reg 1.01774316277423e-06 --train_batch_size 4096 --item_id_emb_size 512 --youtubednn_sampled_softmax_n_candidates 500

Jun 06 '22 15:06 gabrielspmoreira

I did an investigation to isolate the issue and discovered that for YouTubeDNN model negative sampling is being done during evaluation here with training=False and testing=True, while before the refactory we had (training=False and eval_sampling=False) in this case, which skipped the negative sampling during evaluation for evaluating over all items (like we do for other retrieval models). Then the following logic breaks, as num_classes cannot be inferred for the 2nd dim of sampled predictions.

num_classes = tf.shape(predictions)[-1]

In order to fix this, I see some options: 1 - Return the Callback that sets a bool variable that informs for model.test_step() whether the evaluation is being performed during training (should sample negatives for metrics computation in this case) or during model evaluation (should not sample negatives, but rather evaluate over all items) 2 - Test (and potentially adapt) @karlhigley PR #473 that turns the YouTubeDNN into a RetrievalModel, as for RetrievalModel we already have behaviour that performs sampling for evaluation during fit(), and evaluate over all items during model.evaluate()

Thoughts @marcromeyn ?

Jun 07 '22 00:06 gabrielspmoreira

The research script should be updated and the issues should be verified again. A unit test for RetrievalModel should also be added.

Feb 06 '23 16:02 rnyak