Errors with new MS MARCO v2.1 and BEIR regressions
Running:
java -cp `ls target/*-fatjar.jar` io.anserini.reproduce.RunMsMarco -v 2
Getting some errors:
# Running condition "bm25-segmented": BM25 v2.1 Segmented Corpus (k1=0.9, b=0.4)
- topic_key: msmarco-v2-doc-dev
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics msmarco-v2-doc-dev -output runs/run.msmarco-v2.1-doc.bm25-segmented.msmarco-v2-doc-dev.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
MRR@10: 0.0000 [FAIL] expected 0.1973
- topic_key: msmarco-v2-doc-dev2
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics msmarco-v2-doc-dev2 -output runs/run.msmarco-v2.1-doc.bm25-segmented.msmarco-v2-doc-dev2.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
MRR@10: 0.0000 [FAIL] expected 0.2000
- topic_key: trec2021-dl
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics trec2021-dl -output runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
MAP: 0.0000 [FAIL] expected 0.2609
MRR@10: 0.0000 [FAIL] expected 0.9026
nDCG@10: 0.0000 [FAIL] expected 0.5778
R@100: 0.0000 [FAIL] expected 0.3811
R@1K: 0.0000 [FAIL] expected 0.7115
- topic_key: trec2022-dl
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics trec2022-dl -output runs/run.msmarco-v2.1-doc.bm25-segmented.trec2022-dl.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
MAP: 0.0000 [FAIL] expected 0.1079
MRR@10: 0.0000 [FAIL] expected 0.7213
nDCG@10: 0.0000 [FAIL] expected 0.3576
R@100: 0.0000 [FAIL] expected 0.2330
R@1K: 0.0000 [FAIL] expected 0.4790
- topic_key: trec2023-dl
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics trec2023-dl -output runs/run.msmarco-v2.1-doc.bm25-segmented.trec2023-dl.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
MAP: 0.0000 [FAIL] expected 0.1391
MRR@10: 0.0000 [FAIL] expected 0.6519
nDCG@10: 0.0000 [FAIL] expected 0.3356
R@100: 0.0000 [FAIL] expected 0.3049
R@1K: 0.0000 [FAIL] expected 0.5852
@wu-ming233 can you please take a look?
Cannot seem to reproduce the issue locally at the moment...will try again after clearing cache.
Similarly, getting:
# Running condition "Dp": bge-base-en-v1.5 cached queries
- topic_key: trec-covid
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-trec-covid.bge-base-en-v1.5 -topics beir-trec-covid.bge-base-en-v1.5 -output runs/run.beir.Dp.trec-covid.txt -threads 16 -efSearch 1000 -removeQuery
Run successfully completed!
Evaluation command failed for metric: nDCG@10
- topic_key: bioasq
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-bioasq.bge-base-en-v1.5 -topics beir-bioasq.bge-base-en-v1.5 -output runs/run.beir.Dp.bioasq.txt -threads 16 -efSearch 1000 -removeQuery
Run successfully completed!
Evaluation command failed for metric: nDCG@10
...
More debugging trace:
# Running condition "Dp": bge-base-en-v1.5 cached queries
- topic_key: trec-covid
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-trec-covid.bge-base-en-v1.5 -topics beir-trec-covid.bge-base-en-v1.5 -output runs/run.beir.Dp.trec-covid.txt -threads 16 -efSearch 1000 -removeQuery
Run successfully completed!
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m ndcg_cut.10 beir-v1.0.0-trec-covid.test runs/run.beir.Dp.trec-covid.txt
Evaluation command failed for metric: nDCG@10
The issue is here:
% java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-trec-covid.bge-base-en-v1.5 -topics beir-trec-covid.bge-base-en-v1.5 -output runs/run.beir.Dp.trec-covid.txt -threads 16 -efSearch 1000 -removeQuery
Error: "-efSearch" is not a valid option. For help, use "-options" to print out information about options.
@wu-ming233 can you please fix?
Okay, this is weird. Adding debugging information and commenting out parts of the yaml:
# Running condition "bm25": BM25 v2.1 (k1=0.9, b=0.4)
- topic_key: trec2021-dl
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc -topics trec2021-dl -output runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt -hits 1000 -bm25
Run successfully completed!
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -M 100 -m map dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt
MAP: 0.2281 [OK]
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -M 100 -m recip_rank dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt
MRR@10: 0.8466 [OK]
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m ndcg_cut.10 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt
nDCG@10: 0.5183 [OK]
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m recall.100 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt
R@100: 0.3502 [OK]
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m recall.1000 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25.trec2021-dl.txt
R@1K: 0.6915 [OK]
# Running condition "bm25-segmented": BM25 v2.1 Segmented Corpus (k1=0.9, b=0.4)
- topic_key: trec2021-dl
Running retrieval command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index msmarco-v2.1-doc-segmented -topics trec2021-dl -output runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt -hits 1000 -bm25 -selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 1000
Run successfully completed!
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -M 100 -m map dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt
MAP: 0.0000 [FAIL] expected 0.2609
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -M 100 -m recip_rank dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt
MRR@10: 0.0000 [FAIL] expected 0.9026
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m ndcg_cut.10 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt
nDCG@10: 0.0000 [FAIL] expected 0.5778
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m recall.100 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt
R@100: 0.0000 [FAIL] expected 0.3811
Running evaluation command: java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar trec_eval -c -m recall.1000 dl21-doc-msmarco-v2.1 runs/run.msmarco-v2.1-doc.bm25-segmented.trec2021-dl.txt
R@1K: 0.0000 [FAIL] expected 0.7115
But when I copy/paste the commands separately, seems to work fine... 🤷♂️
Fixed the typo that caused this evaluation command to fail for bge-base-en-v1.5 cached queries:
% java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-trec-covid.bge-base-en-v1.5 -topics beir-trec-covid.bge-base-en-v1.5 -output runs/run.beir.Dp.trec-covid.txt -threads 16 -efSearch 1000 -removeQuery
Error: "-efSearch" is not a valid option. For help, use "-options" to print out information about options.
Still looking into the issue where the evaluation commands give actual metric 0 and failing the checks. I still cannot always reproduce the issue; currently suspecting it might have something to do with user downloading the indexes. I will keep investigating.
Sorry that I am taking some time with this fix :( my local compute takes very long to run the regressions. If this is urgent, I will look for more powerful computes.
Fixed the typo that caused this evaluation command to fail for bge-base-en-v1.5 cached queries:
% java -cp /Users/jimmylin/workspace/anserini/target/anserini-0.36.1-SNAPSHOT-fatjar.jar io.anserini.search.SearchCollection -threads 16 -index beir-v1.0.0-trec-covid.bge-base-en-v1.5 -topics beir-trec-covid.bge-base-en-v1.5 -output runs/run.beir.Dp.trec-covid.txt -threads 16 -efSearch 1000 -removeQuery Error: "-efSearch" is not a valid option. For help, use "-options" to print out information about options.
Thanks!
Still looking into the issue where the evaluation commands give actual metric 0 and failing the checks. I still cannot always reproduce the issue; currently suspecting it might have something to do with user downloading the indexes. I will keep investigating.
I don't think it's downloading... perhaps some type of process management issue from Java? Because when I run the commands myself, it seems to work fine. Maybe some underlying race condition?
Sorry that I am taking some time with this fix :( my local compute takes very long to run the regressions. If this is urgent, I will look for more powerful computes.
No worries, this isn't absolutely critical to the operation of the toolkit... (yet!)