ir_datasets
ir_datasets copied to clipboard
Provides a common interface to many IR ranking datasets.
This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn't seem to be. Say I want to gather all...
Hi ! Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor...
**Dataset Information:** To appear at SIGIR 2022. Is there a more succinct name for this dataset? **Links to Resources:** - [Paper](https://arxiv.org/pdf/2205.11685.pdf) - [Repo](https://github.com/SIGIR-2022/A-Dataset-for-Sentence-Retrieval-for-Open-Ended-Dialogues) **Dataset ID(s) & supported entities:** TBD **Checklist**...
**Describe the bug** I've stumbled on this before, and it seems like the same issue happens here. `.z` and `.Z` files are not always equivalent, but `TrecDocs` treat them like...
regarding #189 Took the opportunity to improve the tests here for the variety of formats, etc. that TrecDocs may encounter. @ArthurCamara -- mind running `python -m tests.integration.disks45` when using your...
**Dataset Information:** A Chinese question answering dataset. **Links to Resources:** - Repo: https://github.com/baidu/DuReader - Paper: https://arxiv.org/abs/2203.10232 **Dataset ID(s) & supported entities:** - TBD **Checklist** Mark each task once completed. All...
**Dataset Information:** "WANDS is a Wayfair product search relevance dataset." **Links to Resources:** - https://github.com/wayfair/WANDS - https://easychair.org/publications/preprint_download/j2D4 **Dataset ID(s) & supported entities:** - `wands` (docs, queries, qrels) **Checklist** Mark each...
Right now, the `ir-datasets.bib` file is a bit messy, with inconsistencies in the ids/fields/formatting/etc. across records. It's probably best to go with an established source, such as DBLP, the ACL...
**Dataset Information:** An Urdu test collection. **Links to Resources:** - https://arxiv.org/pdf/2011.00565.pdf **Dataset ID(s) & supported entities:** - `cure` **Checklist** Mark each task once completed. All should be checked prior to...
**Dataset Information:** "The main task for the proposed track is ad-hoc cross-language retrieval. Documents will be drawn from Common Crawl newswire, and will be written in Chinese, Russian, and Persian....