ir_datasets issues

Direct access to all doc_ids

5

This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn't seem to be. Say I want to gather all...

ArthurCamara

Hi ! Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor...

grodino

A Dataset for Sentence Retrieval for Open-Ended Dialogues

**Dataset Information:** To appear at SIGIR 2022. Is there a more succinct name for this dataset? **Links to Resources:** - [Paper](https://arxiv.org/pdf/2205.11685.pdf) - [Repo](https://github.com/SIGIR-2022/A-Dataset-for-Sentence-Retrieval-for-Open-Ended-Dialogues) **Dataset ID(s) & supported entities:** TBD **Checklist**...

seanmacavaney

add-dataset

TrecDocs: .Z and .z files are different.

7

**Describe the bug** I've stumbled on this before, and it seems like the same issue happens here. `.z` and `.Z` files are not always equivalent, but `TrecDocs` treat them like...

ArthurCamara

bug

handling .z files as gzip

3

regarding #189 Took the opportunity to improve the tests here for the variety of formats, etc. that TrecDocs may encounter. @ArthurCamara -- mind running `python -m tests.integration.disks45` when using your...

seanmacavaney

DuReader

**Dataset Information:** A Chinese question answering dataset. **Links to Resources:** - Repo: https://github.com/baidu/DuReader - Paper: https://arxiv.org/abs/2203.10232 **Dataset ID(s) & supported entities:** - TBD **Checklist** Mark each task once completed. All...

seanmacavaney

add-dataset

WANDS

**Dataset Information:** "WANDS is a Wayfair product search relevance dataset." **Links to Resources:** - https://github.com/wayfair/WANDS - https://easychair.org/publications/preprint_download/j2D4 **Dataset ID(s) & supported entities:** - `wands` (docs, queries, qrels) **Checklist** Mark each...

seanmacavaney

add-dataset

Use bibtex from [dblp, acl anthology, ir anthology, acm dl, elsewhere?]

Right now, the `ir-datasets.bib` file is a bit messy, with inconsistencies in the ids/fields/formatting/etc. across records. It's probably best to go with an established source, such as DBLP, the ACL...

seanmacavaney

documentation

CURE

**Dataset Information:** An Urdu test collection. **Links to Resources:** - https://arxiv.org/pdf/2011.00565.pdf **Dataset ID(s) & supported entities:** - `cure` **Checklist** Mark each task once completed. All should be checked prior to...

seanmacavaney

add-dataset

TREC NeuCLIR 2022

4

**Dataset Information:** "The main task for the proposed track is ad-hoc cross-language retrieval. Documents will be drawn from Common Crawl newswire, and will be written in Chinese, Russian, and Persian....

seanmacavaney

add-dataset

ir_datasets
ir_datasets copied to clipboard

Metadata

Direct access to all doc_ids

Fix and add html extractor

A Dataset for Sentence Retrieval for Open-Ended Dialogues

TrecDocs: .Z and .z files are different.

handling .z files as gzip

DuReader

WANDS

Use bibtex from [dblp, acl anthology, ir anthology, acm dl, elsewhere?]

CURE

TREC NeuCLIR 2022

← Metadata

Owner

Metadata

ir_datasets ir_datasets copied to clipboard

Metadata

← Metadata

Owner

Metadata

ir_datasets
ir_datasets copied to clipboard