ir_datasets
ir_datasets copied to clipboard
Datamaestro integration
Is your feature request related to a problem? Please describe.
Integration within datamaestro
Describe the solution you'd like
I am working on integrating ir-datasets into datamestro so that querying available datasets is more standardized (JSON generation, etc.), which in turns provides a way to automate indexing and retrieval (see e.g. retrieval with experimaestro-ir).
It would also allow to consider dataset management within ir-datasets (cleanup, documentation and maybe more when datamaestro matures)
Describe alternatives you've considered
None
Additional context
At the moment, I am coding within experimaestro-ir but would be glad to move the code to irds and modifying datamaestro so that it is more generic (abstracting away dataset access). If moving to ir_datasets, the code will be isolated so that it only is triggered when datamaestro is installed and used.
Awesome! Let me know if there are changes in ir_datasets that could help facilitate this.
You can access the documentation for a given dataset via dataset.documentation(), which returns a dict. Every dataset has a 'desc' as HTML. There's other structured information too (e.g., "bibtex_ids", which points to records in ir_datasets.bib and official_measures, which points to measure names from ir_measures), but these fields are not always present.
I will submit patch requests when needed.
I am already integrating ir_measures into experimaestro-ir, I have to think about how to make a full bridge.