ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

LongEval Retrieval (used at CLEF 2023)

Open mam10eks opened this issue 2 years ago • 6 comments

Dataset Information:

The goal would be to integrate the data of LongEval for the task 1 on retrieval.

The information from the official task description:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.

Links to Resources:

https://clef-longeval.github.io/

Dataset ID(s) & supported entities:

  • longeval/en/train: docs, queries, qrels
  • longeval/en/heldout: docs, queries
  • longeval/en/a-short-july: docs, queries
  • longeval/en/b-long-september: docs, queries
  • longeval/fr/train: docs, queries, qrels
  • longeval/fr/heldout: docs, queries
  • longeval/fr/a-short-july: docs, queries
  • longeval/fr/b-long-september: docs, queries

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • [ ] Dataset definition (in ir_datasets/datasets/[topid].py)
  • [ ] Tests (in tests/integration/[topid].py)
  • [ ] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • [ ] Documentation (in ir_datasets/etc/[topid].yaml)
    • [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
  • [ ] Downloadable content (in ir_datasets/etc/downloads.json)
    • [ ] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

mam10eks avatar May 13 '23 08:05 mam10eks

I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.

mam10eks avatar May 13 '23 08:05 mam10eks

Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:

  • longeval (plaeholder)
    • /[2023-07|2023-09|...] (placeholder)
      • /[en|fr|...] (docs)
        • /[train|heldout|eval|...] (docs, queries, qrels)`

Though maybe I'm missing something about how the task is structured?

seanmacavaney avatar May 15 '23 18:05 seanmacavaney

Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)

mam10eks avatar May 16 '23 11:05 mam10eks

That would be awesome! I love when folks release data in standard formats :-)

seanmacavaney avatar May 16 '23 16:05 seanmacavaney

If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information. Is it something that is feasible within ir-datasets?

romaindeveaud avatar Jun 20 '23 12:06 romaindeveaud

Dear Romain,

Thanks for reaching out. Yes, this is feasible.

The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this). I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.

Best regards,

Maik

mam10eks avatar Jul 19 '23 23:07 mam10eks