ir_datasets
ir_datasets copied to clipboard
DaReCzech
Dataset Information:
A rather large dataset in Czech.
Links to Resources:
- Repo: https://github.com/Seznam/DaReCzech
- Paper: https://arxiv.org/pdf/2112.01810.pdf
Dataset ID(s) & supported entities:
dareczech(docs)dareczech/train(docs, queries, qrels)dareczech/train/small(docs, queries, qrels)dareczech/dev(docs, queries, qrels)dareczech/test(docs, queries, qrels)
It appears to be a re-ranking dataset, so scorddocs will also likely be provided.
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
- [ ] Dataset definition (in
ir_datasets/datasets/[topid].py) - [ ] Tests (in
tests/integration/[topid].py) - [ ] Metadata generated (using
ir_datasets generate_metadatacommand, should appear inir_datasets/etc/metadata.json) - [ ] Documentation (in
ir_datasets/etc/[topid].yaml)- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
- [ ] Downloadable content (in
ir_datasets/etc/downloads.json)- [ ] Download verification action (in
.github/workflows/verify_downloads.yml). Only one needed pertopid. - [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in
downloads.json.
- [ ] Download verification action (in
Additional comments/concerns/ideas/etc.
The dataset is only available on request and after accepting a disclaimer. So it will be another semi-manual dataset with instructions provided for access.
I've requested access