ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

TREC CAST

Open seanmacavaney opened this issue 5 years ago • 12 comments

For conversational AI. http://www.treccast.ai/

Documents: Uses MS-MARCO, TREC CAR, and Washington Post collections.

Also includes a list of duplicate files, due to the combination of collections, it seems.

Queries/qrels: Queries in sequence. 30 training for Y1, 50 testing for Y1, 50 testing for Y2.

Queries are in sequence. The sequence can be encoded as another field in the

Query text includes #combine() syntax.

seanmacavaney avatar Nov 08 '20 18:11 seanmacavaney

Looks like it uses TREC CAR v2, so depends on #5

seanmacavaney avatar Nov 08 '20 19:11 seanmacavaney

The task is happening again in 2021: https://trec.nist.gov/pubs/call2021.html

seanmacavaney avatar Mar 03 '21 12:03 seanmacavaney

wapo collection added for #51

seanmacavaney avatar Mar 31 '21 11:03 seanmacavaney

Related to #80

seanmacavaney avatar Jun 24 '21 13:06 seanmacavaney

Proposed structure:

trec-cast # placeholder
trec-cast/2019 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/2019/train # limited set of training topics provided
trec-cast/2019/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/2019/eval
trec-cast/2020 # corpus: MSMARCOv1 + CARv2  (any dedup??)
trec-cast/2021 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)

So, to get this going, we need to:

  • Add KILT (#80)
  • Add CARv2 (#5)
  • Add WaPO2020 (#43)

Then have a component that merges and dedupes the corpus (per the dedup files).

After this, the topics and qrels should be easy.

seanmacavaney avatar Dec 12 '21 12:12 seanmacavaney

Progress made on this branch.

Noticed that WaPo v2 wasn't used for evaluation on 2019, so it should be removed. The tricky bit is now that the 2019/train and 2019/train/judged do use WaPo v2, but 2019/eval does not. What to do... Give the different corpora v1, v2, ... names, as was done for PMC?

trec-cast # placeholder
trec-cast/v0 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/v0/train # limited set of training topics provided
trec-cast/v0/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/v1 # corpus: MSMARCOv1 + CARv2 
trec-cast/v1/2019
trec-cast/v1/2020
trec-cast/v2 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)
trec-cast/v2/2021

seanmacavaney avatar Jan 04 '22 13:01 seanmacavaney

Getting closer to adding CAsT 2021 with the addition of KILT in #161

seanmacavaney avatar Feb 25 '22 21:02 seanmacavaney

Started to work again on the integration of CAST into ir_datasets.

The dependence on Spacy to reproduce the splitting done in CaST is overly complex in the current branch. I started to work on an alternate solution that uses the official splits to match the original documents. This would require storing the offset files (along with hashes to be on the safe side) on some servers, but the advantage would be to get rid of the dependence on Spacy.

The offset file looks like this

...
{"id": "KILT_20189", "ranges": [[[0, 1338]], [[1341, 2437]], [[2440, 3682]], [[3685, 5023]], [[5026, 6439]], [[6442, 7670]], [[7672, 8444]], [[8447, 10270]], [[7672, 8444], [10273, 10794]], [[10796, 12094]], [[12096, 13437]], [[13440, 14750]], [[14752, 15808]], [[15810, 17226]], [[17228, 18461]], [[18465, 19862]], [[19865, 21125]], [[21127, 22422]], [[22424, 23794]], [[23796, 25072]], [[25074, 26118]], [[26120, 27370]], [[27372, 28645]], [[28647, 29884]], [[29886, 31213]], [[31215, 32461]], [[32463, 33458]], [[33460, 34733]], [[34735, 36013]], [[36015, 37100]], [[37102, 38434]], [[38437, 39293]], [[39296, 40513]], [[40516, 41636]], [[41639, 42761]], [[42764, 43634]]], "md5": "06058b7a8193d0cd9f1d5139abf36263"}
...

that specifies the offsets of the different passages composing the document KILT_20189.

When processing with ir_datasets, using the offset file along with the original files allows to recover the CAsT splitting (although spaces introduced by spacy are lost, but I don't think this is a big deal, on the contrary)

bpiwowar avatar Jan 25 '24 21:01 bpiwowar

Awesome, I like this approach a lot. It seems like a perfect compromise that allows the files to be downloaded from the original source (or obtained through the proper channels, in the case of wapo) while also avoiding the complexity of spacy-dependence. Bravo!

seanmacavaney avatar Jan 25 '24 22:01 seanmacavaney

OK, so I will continue in this direction. Is there a storage location for ir_datasets related files?

bpiwowar avatar Jan 26 '24 06:01 bpiwowar

There are several options. How big do you expect the offset files to be? They might be able to fit on mirror.ir-datasets.com (hosted via a github site): https://github.com/seanmacavaney/irds-mirror/

If not, they could probably go up on huggingface.

seanmacavaney avatar Jan 26 '24 09:01 seanmacavaney

For the offset files, the total will be around 1.4GB (227M x 2 for KILT 2021 and 2022, 183M for MS Marco V1, 550M for MS Marco V2, 41M for WAPO).

What should I do with the Python script that generates them (could be useful for reference)?

I also started a pull request https://github.com/allenai/ir_datasets/pull/255 waiting for your comments

bpiwowar avatar Jan 27 '24 09:01 bpiwowar