ir_datasets
ir_datasets copied to clipboard
TREC CAST
For conversational AI. http://www.treccast.ai/
Documents: Uses MS-MARCO, TREC CAR, and Washington Post collections.
Also includes a list of duplicate files, due to the combination of collections, it seems.
Queries/qrels: Queries in sequence. 30 training for Y1, 50 testing for Y1, 50 testing for Y2.
Queries are in sequence. The sequence can be encoded as another field in the
Query text includes #combine() syntax.
Looks like it uses TREC CAR v2, so depends on #5
The task is happening again in 2021: https://trec.nist.gov/pubs/call2021.html
wapo collection added for #51
Related to #80
Proposed structure:
trec-cast # placeholder
trec-cast/2019 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph) (MSMARCO&WaPo deduped per provided files)
trec-cast/2019/train # limited set of training topics provided
trec-cast/2019/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/2019/eval
trec-cast/2020 # corpus: MSMARCOv1 + CARv2 (any dedup??)
trec-cast/2021 # corpus: MSMARCOv1 + WaPo2020 + KILT (dedup)
So, to get this going, we need to:
- Add KILT (#80)
- Add CARv2 (#5)
- Add WaPO2020 (#43)
Then have a component that merges and dedupes the corpus (per the dedup files).
After this, the topics and qrels should be easy.
Progress made on this branch.
Noticed that WaPo v2 wasn't used for evaluation on 2019, so it should be removed. The tricky bit is now that the 2019/train and 2019/train/judged do use WaPo v2, but 2019/eval does not. What to do... Give the different corpora v1, v2, ... names, as was done for PMC?
trec-cast # placeholder
trec-cast/v0 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph) (MSMARCO&WaPo deduped per provided files)
trec-cast/v0/train # limited set of training topics provided
trec-cast/v0/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/v1 # corpus: MSMARCOv1 + CARv2
trec-cast/v1/2019
trec-cast/v1/2020
trec-cast/v2 # corpus: MSMARCOv1 + WaPo2020 + KILT (dedup)
trec-cast/v2/2021
Getting closer to adding CAsT 2021 with the addition of KILT in #161
Started to work again on the integration of CAST into ir_datasets.
The dependence on Spacy to reproduce the splitting done in CaST is overly complex in the current branch. I started to work on an alternate solution that uses the official splits to match the original documents. This would require storing the offset files (along with hashes to be on the safe side) on some servers, but the advantage would be to get rid of the dependence on Spacy.
The offset file looks like this
...
{"id": "KILT_20189", "ranges": [[[0, 1338]], [[1341, 2437]], [[2440, 3682]], [[3685, 5023]], [[5026, 6439]], [[6442, 7670]], [[7672, 8444]], [[8447, 10270]], [[7672, 8444], [10273, 10794]], [[10796, 12094]], [[12096, 13437]], [[13440, 14750]], [[14752, 15808]], [[15810, 17226]], [[17228, 18461]], [[18465, 19862]], [[19865, 21125]], [[21127, 22422]], [[22424, 23794]], [[23796, 25072]], [[25074, 26118]], [[26120, 27370]], [[27372, 28645]], [[28647, 29884]], [[29886, 31213]], [[31215, 32461]], [[32463, 33458]], [[33460, 34733]], [[34735, 36013]], [[36015, 37100]], [[37102, 38434]], [[38437, 39293]], [[39296, 40513]], [[40516, 41636]], [[41639, 42761]], [[42764, 43634]]], "md5": "06058b7a8193d0cd9f1d5139abf36263"}
...
that specifies the offsets of the different passages composing the document KILT_20189.
When processing with ir_datasets, using the offset file along with the original files allows to recover the CAsT splitting (although spaces introduced by spacy are lost, but I don't think this is a big deal, on the contrary)
Awesome, I like this approach a lot. It seems like a perfect compromise that allows the files to be downloaded from the original source (or obtained through the proper channels, in the case of wapo) while also avoiding the complexity of spacy-dependence. Bravo!
OK, so I will continue in this direction. Is there a storage location for ir_datasets related files?
There are several options. How big do you expect the offset files to be? They might be able to fit on mirror.ir-datasets.com (hosted via a github site): https://github.com/seanmacavaney/irds-mirror/
If not, they could probably go up on huggingface.
For the offset files, the total will be around 1.4GB (227M x 2 for KILT 2021 and 2022, 183M for MS Marco V1, 550M for MS Marco V2, 41M for WAPO).
What should I do with the Python script that generates them (could be useful for reference)?
I also started a pull request https://github.com/allenai/ir_datasets/pull/255 waiting for your comments