ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

TREC NeuCLIR 2022

Open seanmacavaney opened this issue 3 years ago • 4 comments

Dataset Information:

"The main task for the proposed track is ad-hoc cross-language retrieval. Documents will be drawn from Common Crawl newswire, and will be written in Chinese, Russian, and Persian. Topics will be in English, and will be expressed in traditional TREC title/description/narrative form. Retrieved documents will be graded as highly relevant, somewhat relevant, and not relevant; we expect to use several metrics to evaluate runs, including nDCG@100 and ERR."

Links to Resources:

  • https://neuclir.github.io/
  • Related: HC4

Dataset ID(s) & supported entities:

  • TBD

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • [ ] Dataset definition (in ir_datasets/datasets/[topid].py)
  • [ ] Tests (in tests/integration/[topid].py)
  • [ ] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • [ ] Documentation (in ir_datasets/etc/[topid].yaml)
    • [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
  • [ ] Downloadable content (in ir_datasets/etc/downloads.json)
    • [ ] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

cc: @eugene-yang

seanmacavaney avatar Feb 25 '22 22:02 seanmacavaney

As the structure of this collection will be very similar to HC4, should we just reuse HC4Doc defined in ./datasets/hc4.py?

eugene-yang avatar Mar 09 '22 23:03 eugene-yang

Yup, you can give the format a name and move them under ir_datasets/formats/

seanmacavaney avatar Mar 10 '22 07:03 seanmacavaney

@seanmacavaney
For the directory and namespace structure, do you think neuclir/22 would be better (both neuclir and neuclir/22 are dummy levels)? This could accommodate future datasets/topics to be under neuclir namespace without.

eugene-yang avatar Mar 11 '22 19:03 eugene-yang

I am revisiting this and hopefully adding NeuCLIR 22 topics and queries as well.

eugene-yang avatar Feb 18 '24 05:02 eugene-yang