OpenNIR Reproducing CEDR-KNRM results on ANTIQUE

Hello, I'm trying to reproduce results from the OpenNIR paper using the Vanilla BERT and CEDR-KNRM models on the ANTIQUE dataset.

Taking my cues from the wsdm2020_demo.sh script, I trained my models as follow:

First I fine-tuned and tested a Vanilla BERT model:

BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS 
python -m onir.bin.pipeline config/antique config/vanilla_bert $BERT_MODEL_PARAMS  pipeline.test=true

Which produced the following results: test epoch=60 judged@10=0.6110 map_rel-3=0.2540 [mrr_rel-3=0.7288] p_rel-3@1=0.6450 p_rel-3@3=0.4917 However, published results for Vanilla BERT are as follow:

MAP: 0.2801
MRR: 0.7101
P@1: 0.5950
P@3: 0.4967

I then initialized a CEDR-KNRM model using weights from the fine-tuned Vanilla BERT model and trained and tested it:

MODEL_PATH=[PATH_TO_FINE_TUNED_BERT]/60.p
BERT_MODEL_PARAMS="trainer.grad_acc_batch=1 valid_pred.batch_size=4 test_pred.batch_size=4"

python -m onir.bin.extract_bert_weights config/antique config/vanilla_bert $BERT_MODEL_PARAMS pipeline.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.overwrite=True
python -m onir.bin.pipeline config/antique config/cedr/knrm $BERT_MODEL_PARAMS vocab.bert_weights=$MODEL_PATH pipeline.test=true

Which produced the following results: test epoch=30 judged@10=0.6030 map_rel-3=0.2563 [mrr_rel-3=0.7302] p_rel-3@1=0.6400 p_rel-3@3=0.5083 However, published results for CEDR-KNRM are as follow:

MAP: 0.2861
MRR: 0.7238
P@1: 0.6300
P@3: 0.4933

According to the logs, I understand that the inference is deterministic ([trainer:pairwise][DEBUG] using GPU (deterministic)). Could anyone let me know what I am doing wrong? Where does the differences come from (especially w.r.t. MAP)?

Jun 11 '20 15:06 stepgazaille

Hi Stéphane,

Unfortunately the deterministic indicator only corresponds to torch.backends.cudnn.deterministic flag-- which doesn't actually control for differences across specific GPUs or CUDA versions. Anecdotally, I've seen that different GPUs can yield difference results. So I suspect that these differences lead to the performance discrepancies you're observing. Which GPU do you have? What version of CUDA are you using?

sean

Jun 11 '20 16:06 seanmacavaney

Hello Sean, Thank you for the quick answer! Here's my current setup:

OS: Ubuntu 19.10 Eoan Ermine
GPU: GeForce RTX 2080 SUPER
NVIDIA Driver Version: 440.33.01
CUDA Version: 10.2

Is that very different from your setup?

Jun 11 '20 16:06 stepgazaille

Our setup for running the experiment was:

OS: Ubuntu 18.04
GPU: GeForce GTX 1080 Ti
NVIDIA Driver Version: 418.67
CUDA Version: 10.1

So there are differences there. To rule out other possibilities, do you get the same results as reported for BM25? The version of Anserini in the repository was updated since OpenNIR was originally released.

Jun 11 '20 16:06 seanmacavaney

Executing the following commands:

scripts/pipeline.sh config/grid_search config/antique
scripts/pipeline.sh config/grid_search config/antique pipeline.test=True

I obtain the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5793] p_rel-3@1=0.4550 p_rel-3@3=0.3650

Published results for BM25 are as follow:

MAP: 0.1888
MRR: 0.5464
P@1: 0.4450
P@3: 0.3467

So a couple of differences here too. Do you remember which commit you were at when you ran the tests that lead to the reported results? I could try re-executing the commands above using that version OpenNIR.

Jun 11 '20 17:06 stepgazaille

It should be the initial commit: ca14dfa5e7...

Note that you'll need to clear the ~/data/onir directory (or rename it), otherwise it will use the indices built from the newer version.

Jun 11 '20 18:06 seanmacavaney

Hello Sean,

Today I cleaned up my ~/data/onir directory, pulled the initial commit and re-ran the experiments.

The BM25 baseline produced the following results: test bm25_k1-1.4_b-0.40 judged@10=0.5960 map_rel-3=0.1945 [mrr_rel-3=0.5797] p_rel-3@1=0.4550 p_rel-3@3=0.3667 So here P@1 and P@3 do match the reported results (0.4450 and 0.3467 respectively), however I'm surprised to find out that MAP and MRR do not.

Fine-tuning BERT produced the following results: test epoch=22 judged@10=0.6050 map_rel-3=0.2536 [mrr_rel-3=0.7125] p_rel-3@1=0.6200 p_rel-3@3=0.5033, which do not match the reported results.

Training CEDR-KNRM model (initialised using the newly fine-tuned BERT weights) produced the following results: test epoch=14 judged@10=0.6105 map_rel-3=0.2537 [mrr_rel-3=0.7105] p_rel-3@1=0.6100 p_rel-3@3=0.5017, which do not match the reported results either.

Here I'm surprised to find out that CEDR-KNRM's performance is lower than the fine-tuned BERT's. I used all the same commands as in my previous comments. Please let me know if you have any other lead I might try.

On another subject, is there anyway to produce a human-readable version of the models' output? I'd like to do an ad-hoc evaluation of the models I trained so far (compare the predictions to the gold standard, etc).

Thank you for all you help!

Jun 12 '20 18:06 stepgazaille

Hmmm, fascinating! Thanks for running these tests. The BM25 discrepancies are puzzling, as well as the performance differences between Vanilla BERT and CEDR-KNRM. I'm out of ideas about what could cause these differences.

The pipeline saves run files under ~/data/onir/models/.../runs/[epoch].run (should be in pipeline output) in the standard TREC run format. You can find the queries in ~/data/onir/datasets/antique/[subset].queries.txt and document content (which is indexed) can be found here. If you'd like to run the system over arbitrary queries/documents, you can use the flex dataset.

Jun 12 '20 19:06 seanmacavaney

OpenNIR OpenNIR copied to clipboard

Reproducing CEDR-KNRM results on ANTIQUE

OpenNIR
OpenNIR copied to clipboard