ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

beir suite

Open seanmacavaney opened this issue 4 years ago • 4 comments

Dataset Information:

Beir is a suite of benchmarks, intended to be used for testing zero-shot transfer.

These would help extend the tool beyond primarily ad-hoc tasks.

Their benchmarks perform their own pre-processing. For identical comparisons, we should use their same pre-processing (rather than aliasing with the versions we have, where there's overlap). It should be easy to support the datasets they have available as downloads.

Links to Resources:

  • https://github.com/UKPLab/beir/blob/main/README.md
  • https://arxiv.org/abs/2104.08663

Dataset ID(s):

  • beir (empty placeholder)
  • beir/msmarco (docs, queries, qrels) --- the MS MARCO passage collection, dev subset; should correspond with msmarco-passage/dev, but there may be differences in pre-processing
  • trec-covid (docs, queries, qrels) --- the TREC COVID complete benchmark; should correspond with trec-covid, but there may be differences in pre-processing. Plus not all metadata is available in their experimental setting, and it only uses the natural language questions
  • beir/nfcorpus (docs, queries, qrels) --- NFCorpus, unclear which is the corresponding irds ID, but presumably some filtered portion of the test set?
  • ~beir/bioasq (docs, queries, qrels)~ not available for download
  • beir/nq (docs, queries, qrels) --- another version of the natural questions dev dataset; different preprocessing than natural-questions/dev and dpr-w100/natural-questions/dev as document selection is different and different filtering for queries
  • beir/hotpot (docs, queries, qrels) --- HotpotQA
  • beir/fiqa (docs, queries, qrels) --- FiQA-2018
  • ~beir/signal1m (docs, queries, qrels)~ not available for download --- Signal-1M(RT)
  • ~beir/trec-news (docs, queries, qrels)~ not available for download --- TREC Background Linking
  • beir/arguana (docs, queries, qrels) --- ArguAna Counterargument retrieval
  • beir/webis-touche2020 (docs, queries, qrels) --- Touche-2020 conversational arguments
  • beir/cqadupstack (docs, queries, qrels) --- CQADupstack community question answering
  • beir/quora (docs, queries, qrels) --- Quora duplicate question identification
  • beir/dbpedia-entity (docs, queries, qrels) --- DBPedia entity linking
  • beir/scidocs (docs, queries, qrels) --- SCIDOCS citation prediction
  • beir/fever (docs, queries, qrels) --- FEVER fact verification
  • beir/climate-fever (docs, queries, qrels) --- Climate-FEVER fact verification on climate topics
  • beir/scifact (docs, queries, qrels) --- SciFact fact verification from scientific literature

Supported Entities

  • [x] docs
  • [x] queries
  • [x] qrels
  • [ ] scoreddocs
  • [ ] docpairs

Additional comments/concerns/ideas/etc.

Need to be sure to include both the original dataset citation and the citation to Beir in the dataset documentation.

Could having several versions of the same dataset cause confusion? The documentation should provide information to disambiguate.

seanmacavaney avatar Apr 23 '21 13:04 seanmacavaney

A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data), they'd be out of luck until somebody gets around to adding the full version of the datasets.

seanmacavaney avatar Apr 23 '21 13:04 seanmacavaney

I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the metadata key.

seanmacavaney avatar Apr 23 '21 15:04 seanmacavaney

@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata.

Most are either (doc_id, text) or (doc_id, title, text). A few have (undocumented?) metadata as a dictionary, but should be able to properly structure these in a custom namedtuple for that particular corpus.

seanmacavaney avatar Sep 21 '21 16:09 seanmacavaney

Thank you. I am currently fine with it, but if you add more structure in the future, this will be great.

searchivarius avatar Sep 21 '21 16:09 searchivarius