MatchZoo-py icon indicating copy to clipboard operation
MatchZoo-py copied to clipboard

Dataset Builder creates duplicate query-document pairs & model predictions are odd

Open littlewine opened this issue 4 years ago • 1 comments

I have the following issue, which is really odd and affects the evaluation of the neural models. I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated. I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.

Here's most of my code:

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                             train_pack
                                             )

    train_prepr = prpr.transform(train_pack)
    valid_prepr = prpr.transform(valid_pack)
    test_prepr = prpr.transform(test_pack)

    mz.dataloader.dataset_builder.DatasetBuilder()
    train_dataset = dsb.build(train_prepr)
    valid_dataset = dsb.build(valid_prepr)
    test_dataset = dsb.build(test_prepr)

    train_dl = dlb.build(train_dataset, stage='train')
    valid_dl = dlb.build(valid_dataset, stage='dev')
    test_dl = dlb.build(test_dataset, stage='test')

# training the model etc....

    test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
    test_preds['id_left'] = test_dl.id_left
    test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
    test_preds['length_right'] = test_dl._dataset[:][0]['length_right']

Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.

    test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>> 297
    test_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>>0
    test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0

Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???


    print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
                                           keep=False)].sort_values(['id_left', 'id_right'])
          )

>>
           pred  id_left                   id_right  length_right
466  -10.889746   33-1-1  47-07395           896
499   -9.492123   33-1-1  47-07395           896
677   -6.880966   33-1-1  47-07395           896
496  -10.781660   33-1-1  98-33779           535
678   -7.954109   33-1-1  98-33779           535
1044 -11.102488   33-1-1 98-33779           535
508   -6.497414   33-1-1  95-23333           244
1326  -7.466503   33-1-1  95-23333           244

In this replicated example the model used was KNRM, but I think this happens in other models too.

littlewine avatar Apr 20 '20 17:04 littlewine

Hi, @littlewine , there are indeed three kinds of datapack, i.e., point-wise, pair-wise, and list-wise. In fact, for training, we can choose either one according to the loss function. While in testing, we should not organize the datapack into pair-wise since it will add duplicate instances to fill the batch size.

faneshion avatar Sep 20 '20 07:09 faneshion