[BUG] model.evaluate() gives much higher acc value for small batch size when we set `validation_data` int the model.fit()
Bug description
When we train Two-tower model and run model.evaluate() afterwards, we observe very different ndcg value wrt different batch size set in the model.evaluate(). if you remove validation_data from model.fit() we dont observe high ndcg value wrt different batch size.
Steps/Code to reproduce bug
Please run the 05-Retrieval-Model notebook and add the following line after model.fit() training step. Play with the batch_size. Set it to 4096 as well to see the difference in the ndcg value.
metrics = model.evaluate(valid, batch_size=128, item_corpus=train, return_dict=True)
Expected behavior
We should not get a big gap in ndcg value when we change the batch size in model.evaluate().
Environment details
- Merlin version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?): Yes.
merlin-tensorflow-training:22.04 container with the latest main branch pulled.
Root cause analysis
After some investigation, I found out that the root cause for this issue is that model.evaluate() is called by model.fit() when validation_data argument is passed, but can be also be called by the user explicitly.
For RetrievalModel, the evaluation is done differently during model.fit() and model.evaluate() calls. During fit, we use the ItemRetrievalScorer to sample negative items for evaluation as we do for training (so that we can compare train and validation metrics), but during evaluation we set a pre_eval_topk with a TopKIndexBlock which scores the top-k items and return them and the candidate samples.
The issue happens in graph mode, when calling model.fit(validation_data=...) and then model.evaluate(item_corpus=...). The reason is that model.evaluate() is going to be called first by model.fit() when pre_eval_topk is not set yet. Then, when model.evaluate(item_corpus=...) is explicitly called by the user, it is going to use the sampled negatives instead of TopKIndexBlock. That is why in this issue the accuracy matrix vary according to the eval batch size (because when using in-batch negative sampling, the larger the batch sizes, the more negative samples are provided and the lower are ranking metrics).
Candidate solutions
Two potential alternatives for this issue would be: 1 - Go back to having to call model.set_retrieval_candidates_for_evaluation() before model.fit() for retrieval models (which would be very weird to the user) 2 - Have the item_corpus as an argument to the fit() method of retrieval models, and setting pre_eval_topk = TopKIndexBlock before calling super().fit() so that TopKIndexBlock is in the graph when needed. Then when calling model.evaluate() for retrieval models we wouldn't have to specify item_corpus again
But talking with @marcromeyn and @sararb about how to better fix this issue, we end up deciding to not proceed with the fix right now because the above alternatives will either burden the user with unsual call methods (1) or make fit method slower (2).
@marcromeyn is working in a comprehensive PR #368 that will allows decoupling metrics from the model. So, for evaluation over all items the user could explicitly set which candidate set he would like to use, like this prototype snippet by Marc:
model.compile(metrics=[PreEvalTopK(10, "ndgc", "recall")])
So we gonna keep this issue blocked until we merge #368
Reproducibility
I have prepared a unit test to reproduce this issue Ronay reported in the notebook. This test passes when run_eagerly == True, but when it is False (graph mode) it fails because as model.evaluate() is using in-batch negative sampling rather than the TopKIndexBlock, increasing the batch sizes increases the number of negatives and make accuracy metric worse.
@pytest.mark.parametrize("run_eagerly", [True, False])
def test_two_tower_retrieval_model_multiple_types_eval(ecommerce_data: Dataset, run_eagerly):
ecommerce_data.schema = ecommerce_data.schema.remove_by_tag(Tags.TARGET)
df = ecommerce_data.to_ddf().compute()
train_ds = Dataset(df[: len(df) // 2], schema=ecommerce_data.schema)
eval_ds = Dataset(df[len(df) // 2 :], schema=ecommerce_data.schema)
metrics = [
RecallAt(5),
MRRAt(5),
]
model = mm.TwoTowerModel(
schema=ecommerce_data.schema,
query_tower=mm.MLPBlock([128, 64]),
samplers=[mm.InBatchSampler()],
metrics=metrics,
loss="categorical_crossentropy",
)
opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=opt, run_eagerly=run_eagerly)
# Training
num_epochs = 3
losses = model.fit(
train_ds,
batch_size=64,
epochs=num_epochs,
train_metrics_steps=3,
validation_data=eval_ds,
validation_steps=3,
candidate_items_for_eval=train_ds,
)
assert len(losses.epoch) == num_epochs
eval_metrics_1 = model.evaluate(
eval_ds, eval_over_all_items=True, batch_size=10, return_dict=True
)
eval_metrics_2 = model.evaluate(
eval_ds, eval_over_all_items=True, batch_size=50, return_dict=True
)
assert (
len(set(eval_metrics_1.keys()).intersection(set(eval_metrics_2.keys())))
== len(eval_metrics_1)
== len(eval_metrics_2)
)
for key in ["recall_at_5", "mrr_at_5"]:
tf.debugging.assert_near(eval_metrics_1[key], eval_metrics_2[key])
@marcromeyn Will your PR https://github.com/NVIDIA-Merlin/models/pull/368 fix this issue?
@marcromeyn , the PR is marked closed. What is the next step?
@marcromeyn , please create a PR and link it
I reproduced this error and it persists with the V1 of Retrieval models. It was one of the reasons for the creation of V2 retrieval models: MatrixFactorizationModelV2 and TwoTowerModelV2.
So we won't fix that error in V1, which is going to be deprecated. To avoid the issue of silently providing the wrong evaluation metrics in the condition that makes the error happen in graph model (i.e. fit() the model with validation data and then evaluate over all items), I added in #892 an exception in that scenario to warn the user about the deprecation and the recommendation of using V2 retrieval models.