ir_datasets icon indicating copy to clipboard operation
ir_datasets copied to clipboard

Add BioASQ dataset to the list of supported BEIR datasets

Open MathVast opened this issue 1 year ago • 2 comments

Hi @seanmacavaney I would like to use the BioASQ dataset for an experiment and I have stumbled across this on the GitHub repo of the BEIR paper beir-cellar where the author links the preprocessed data for the 4 datasets marked as "unavailable". I am aware that you've been trying to extend the list of available datasets from the benchmark on ir_datasets (ie. this issue) and I was wondering if, given these resources, BioASQ could be integrated to the catalog?

Dataset Information:

BioASQ is a dataset featuring in the BEIR benchmark and originated from a challenge around "biomedical semantic indexing and question answering". More information about the challenge and the dataset can be found here: http://bioasq.org/

Links to Resources:

Link to the steps listed on beir-cellar in order to reproduce the files: https://github.com/beir-cellar/beir/tree/main/examples/dataset#2-bioasq ; Link to the Google Drive space linked in the issue cited above where the preprocessed data can be found: https://drive.google.com/drive/folders/1CgDO-KmQQMpGEGeD3R20ZgTTM008xix9

Dataset ID(s) & supported entities:

  • beir/bioasq-2020: queries, docs
  • beir/bioasq-2020/train: queries, docs, qrels
  • beir/bioasq-2020/test: queries, docs, qrels

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • [ ] Dataset definition (in ir_datasets/datasets/[topid].py)
  • [ ] Tests (in tests/integration/[topid].py)
  • [ ] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • [ ] Documentation (in ir_datasets/etc/[topid].yaml)
    • [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
  • [ ] Downloadable content (in ir_datasets/etc/downloads.json)
    • [ ] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

MathVast avatar Sep 21 '23 13:09 MathVast

Hey @MathVast! Sorry for the delay -- the start of semester is a busy time.

Thanks for opening the issue. This seems doable and like a good addition to the package.

seanmacavaney avatar Oct 07 '23 11:10 seanmacavaney

No problem, in the meantime I've made a fork and worked on the integration in ir_datasets of BioASQ on my side. I've been playing with the dataset through XPM-IR and it seems to be working but you might want to check some of the choices I've made. If it's okay for you @seanmacavaney I can open a PR.

MathVast avatar Oct 07 '23 14:10 MathVast