MatchZoo-py
MatchZoo-py copied to clipboard
Dataset Builder creates duplicate query-document pairs & model predictions are odd
I have the following issue, which is really odd and affects the evaluation of the neural models. I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated. I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.
Here's most of my code:
model, prpr, dsb, dlb = preparer.prepare(model_class,
train_pack
)
train_prepr = prpr.transform(train_pack)
valid_prepr = prpr.transform(valid_pack)
test_prepr = prpr.transform(test_pack)
mz.dataloader.dataset_builder.DatasetBuilder()
train_dataset = dsb.build(train_prepr)
valid_dataset = dsb.build(valid_prepr)
test_dataset = dsb.build(test_prepr)
train_dl = dlb.build(train_dataset, stage='train')
valid_dl = dlb.build(valid_dataset, stage='dev')
test_dl = dlb.build(test_dataset, stage='test')
# training the model etc....
test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
test_preds['id_left'] = test_dl.id_left
test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
test_preds['length_right'] = test_dl._dataset[:][0]['length_right']
Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.
test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum()
>> 297
test_pack.frame().duplicated(['id_left', 'id_right']).sum()
>>0
test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0
Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???
print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
keep=False)].sort_values(['id_left', 'id_right'])
)
>>
pred id_left id_right length_right
466 -10.889746 33-1-1 47-07395 896
499 -9.492123 33-1-1 47-07395 896
677 -6.880966 33-1-1 47-07395 896
496 -10.781660 33-1-1 98-33779 535
678 -7.954109 33-1-1 98-33779 535
1044 -11.102488 33-1-1 98-33779 535
508 -6.497414 33-1-1 95-23333 244
1326 -7.466503 33-1-1 95-23333 244
In this replicated example the model used was KNRM
, but I think this happens in other models too.
Hi, @littlewine , there are indeed three kinds of datapack, i.e., point-wise, pair-wise, and list-wise. In fact, for training, we can choose either one according to the loss function. While in testing, we should not organize the datapack into pair-wise since it will add duplicate instances to fill the batch size.