BioRED icon indicating copy to clipboard operation
BioRED copied to clipboard

BioRED: a rich biomedical relation extraction dataset


BioRED is a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the NER and RE tasks. Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine. The dataset was used by the NIH LitCoin NLP Challenge (https://ncats.nih.gov/funding/challenges/litcoin) and a total of over 200 teams participated. This repository provides the dataset, annotation guideline, source code, and models of our paper.

Content

  • BIORED.zip: This file contains 600 PubMed abstracts with our annotations and is divided into training, development, and test sets.
  • BioRED_Annotation_Guideline.pdf: This file describes the annotation guideline used for annotating BioRED.
  • biored_re_source_code.tar: This file includes our PubMedBERT source code implementation for BioRED relation classification and novelty detection. BERT-GT version can be found at https://github.com/ncbi/bert_gt.
  • biored_re_model.tar: This file contains our models generated by biored_re_source_code.tar. You can use these models for the prediction as well.
    • The fold of the biored_all_mul_model is the relation classification model.
    • The fold of the biored_novelty_model is the novelty detection model.

Citing BioRED

@article{luo2022biored,
  author    = {Luo, Ling and Lai, Po-Ting and Wei, Chih-Hsuan and Arighi, Cecilia N and Lu, Zhiyong},
  title     = {BioRED: A Rich Biomedical Relation Extraction Dataset},
  journal   = {Briefing in Bioinformatics},
  year      = {2022},
  publisher = {Oxford University Press}
}

Acknowledgments

The authors are grateful to Drs. Tyler F. Beck and Christine Colvis, Scientific Program Officer at the NCATS and their entire research team for help with our dataset. The authors would like to thank Rancho BioSciences and specifically, Mica Smith, Thomas Allen Ford-Hutchinson, and Brad Farrell for their contribution with data curation.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.