ir_datasets issues

why is Gov2 doc headers str but ClueWeb bytes?

As spotted by @searchivarius

seanmacavaney

bug

encoding fixes

1

(i.e., shouldn't rely on default encoding anywhere) wip RE: #151

seanmacavaney

**Dataset Information:** A rather large dataset in Czech. **Links to Resources:** - Repo: https://github.com/Seznam/DaReCzech - Paper: https://arxiv.org/pdf/2112.01810.pdf **Dataset ID(s) & supported entities:** - `dareczech` (docs) - `dareczech/train` (docs, queries, qrels)...

seanmacavaney

add-dataset

min python version 3.7

2

Are we safe bumping the minimum python version from 3.6 to 3.7? The 3.6 end of life is in just a few weeks. related: #139

seanmacavaney

py310

fixes #140

seanmacavaney

Qrel definitions for multiple fields

2

**Is your feature request related to a problem? Please describe.** When query document pairs have multiple labels associated with them in their qrels, e.g., relevance and quality, only the relevance...

janheinrichmerker

enhancement

cw12 and cw12/b13 verificiation and improved instructions

1

As reported by @searchivarius **Describe the bug** Right now, a user can end up with a faulty b13 subset if they only have the b13 disk and follow the instructions...

seanmacavaney

bug

Add dataset license info

1

Per discussion with @diegoceccarelli, it would be nice if the license information for each dataset was included in the documentation.

seanmacavaney

documentation

documentation for integrations

2

**Describe the proposed change** There's a growing number of integrations. Most recently Datamaestro (see #99)! We should document them, give a little promotion for each one, and provide instructions and/or...

seanmacavaney

documentation

beir suite

4

**Dataset Information:** [Beir](https://github.com/UKPLab/beir/blob/main/README.md) is a suite of benchmarks, intended to be used for testing zero-shot transfer. These would help extend the tool beyond primarily ad-hoc tasks. Their benchmarks perform their...

seanmacavaney

add-dataset

ir_datasets
ir_datasets copied to clipboard

Metadata

why is Gov2 doc headers str but ClueWeb bytes?

encoding fixes

DaReCzech

min python version 3.7

py310

Qrel definitions for multiple fields

cw12 and cw12/b13 verificiation and improved instructions

Add dataset license info

documentation for integrations

beir suite

← Metadata

Owner

Metadata

ir_datasets ir_datasets copied to clipboard

Metadata

← Metadata

Owner

Metadata

ir_datasets
ir_datasets copied to clipboard