ir_datasets
ir_datasets copied to clipboard
LongEval Retrieval (used at CLEF 2023)
Dataset Information:
The goal would be to integrate the data of LongEval for the task 1 on retrieval.
The information from the official task description:
The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.
Links to Resources:
https://clef-longeval.github.io/
Dataset ID(s) & supported entities:
longeval/en/train: docs, queries, qrelslongeval/en/heldout: docs, querieslongeval/en/a-short-july: docs, querieslongeval/en/b-long-september: docs, querieslongeval/fr/train: docs, queries, qrelslongeval/fr/heldout: docs, querieslongeval/fr/a-short-july: docs, querieslongeval/fr/b-long-september: docs, queries
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
- [ ] Dataset definition (in
ir_datasets/datasets/[topid].py) - [ ] Tests (in
tests/integration/[topid].py) - [ ] Metadata generated (using
ir_datasets generate_metadatacommand, should appear inir_datasets/etc/metadata.json) - [ ] Documentation (in
ir_datasets/etc/[topid].yaml)- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
- [ ] Downloadable content (in
ir_datasets/etc/downloads.json)- [ ] Download verification action (in
.github/workflows/verify_downloads.yml). Only one needed pertopid. - [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in
downloads.json.
- [ ] Download verification action (in
Additional comments/concerns/ideas/etc.
I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.
Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:
longeval(plaeholder)/[2023-07|2023-09|...](placeholder)/[en|fr|...](docs)/[train|heldout|eval|...](docs, queries, qrels)`
Though maybe I'm missing something about how the task is structured?
Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)
That would be awesome! I love when folks release data in standard formats :-)
If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information. Is it something that is feasible within ir-datasets?
Dear Romain,
Thanks for reaching out. Yes, this is feasible.
The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this). I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.
Best regards,
Maik