ir_datasets
ir_datasets copied to clipboard
Add BioASQ dataset to the list of supported BEIR datasets
Hi @seanmacavaney I would like to use the BioASQ dataset for an experiment and I have stumbled across this on the GitHub repo of the BEIR paper beir-cellar where the author links the preprocessed data for the 4 datasets marked as "unavailable". I am aware that you've been trying to extend the list of available datasets from the benchmark on ir_datasets (ie. this issue) and I was wondering if, given these resources, BioASQ could be integrated to the catalog?
Dataset Information:
BioASQ is a dataset featuring in the BEIR benchmark and originated from a challenge around "biomedical semantic indexing and question answering". More information about the challenge and the dataset can be found here: http://bioasq.org/
Links to Resources:
Link to the steps listed on beir-cellar in order to reproduce the files: https://github.com/beir-cellar/beir/tree/main/examples/dataset#2-bioasq ; Link to the Google Drive space linked in the issue cited above where the preprocessed data can be found: https://drive.google.com/drive/folders/1CgDO-KmQQMpGEGeD3R20ZgTTM008xix9
Dataset ID(s) & supported entities:
-
beir/bioasq-2020
: queries, docs -
beir/bioasq-2020/train
: queries, docs, qrels -
beir/bioasq-2020/test
: queries, docs, qrels
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
- [ ] Dataset definition (in
ir_datasets/datasets/[topid].py
) - [ ] Tests (in
tests/integration/[topid].py
) - [ ] Metadata generated (using
ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
) - [ ] Documentation (in
ir_datasets/etc/[topid].yaml
)- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
- [ ] Downloadable content (in
ir_datasets/etc/downloads.json
)- [ ] Download verification action (in
.github/workflows/verify_downloads.yml
). Only one needed pertopid
. - [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in
downloads.json
.
- [ ] Download verification action (in
Hey @MathVast! Sorry for the delay -- the start of semester is a busy time.
Thanks for opening the issue. This seems doable and like a good addition to the package.
No problem, in the meantime I've made a fork and worked on the integration in ir_datasets of BioASQ on my side. I've been playing with the dataset through XPM-IR and it seems to be working but you might want to check some of the choices I've made. If it's okay for you @seanmacavaney I can open a PR.