Biomedical-Corpora
Biomedical-Corpora copied to clipboard
A collection of annotated biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.
Biomedical Corpora
A collection of annotated, freely distributable, biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.
All corpora are provided in corpora. They are divided into subdirectories NER, for corpora which can be used to train named entity recognition (NER) solutions, and Relation Extraction, for corpora which can be used to train relation/event extraction solutions. Corpora are provided in both a CoNLL-like format and a Standoff format.
Most corpora in the CoNLL-like format were originally collected here. In many cases, the tags were mapped to 4-letter codes:
| Old tag | New tag |
|---|---|
Chemical, Simple_chemical |
CHED |
Disease |
DISO |
Organism, Species, NCBITaxon, Taxon |
LIVB |
Cellular_component |
COMP |
Cell, cell_type |
CLTP |
cell_line |
CLLN |
Gene, Protein, Gene_or_gene_product, GGP |
PRGE |
Mappings were largely inspired by this API.
Corpora names (loosely) follow the naming scheme: <corpus_name>_<entity>_<tagset>.
Download
To download the corpora, simply clone the repository locally:
$ git clone https://github.com/BaderLab/Biomedical-Corpora.git
Or click the green Clone or download button and select Download ZIP.
Resources
https://github.com/spyysalo provides many useful repositories for working with these corpora. Many of the most popular corpora have their own repositories (e.g. S800, NCBI-Disease) which contain code for collecting the corpus from its original source and converting it into a format suitable for training a machine learning classifier (e.g. CoNLL or Standoff).
Table of Corpora
A list of various biomedical corpora and their corresponsding publications:
| Corpora | Text Genre | Standard | Entities (Count) | Publication |
|---|---|---|---|---|
| AnatEM | Scientific Article | Gold | 12 Anatomical entities | link |
| AZDC | Scientific Article | Gold | Disease | link |
| BioCreative II GM | Scientific Article | Gold | Genes/proteins (24,583) | link |
| BioInfer | Scientific Article | Gold | Genes/proteins | link |
| BioSemantics | Patent | Gold | Chemicals, Disease | link |
| BC4CHEMD | Scientific Article | Gold | Chemicals (84,310) | link |
| BC5CDR | Scientific Article | Gold | Chemicals (15,935), Disease (12,852) | link |
| BioNLP09 | Scientific Article | Gold | Genes/proteins (14,963) | link |
| BioNLP11EPI | Scientific Article | Gold | Genes/proteins (15,811) | link |
| BioNLP11ID | Scientific Article | Gold | Genes/proteins (6551), Organisms (3471), Chemicals (973), Regulon-operon (87) | link |
| BioNLP13GE | Scientific Article | Gold | Genes/proteins (12,057) | link |
| BioNLP13PC | Scientific Article | Gold | Genes/proteins (10,891), Chemicals (2487), Complexes (1502), Cellular component (1013) | link |
| CRAFT | Scientific Article | Gold | Sequence Ontology (18,974), Gene/proteins (16,064), Taxonomy (6868), Chemicals of biological interest (6053), Cell lines (5495), GO-CC (4180) | link |
| CellFinder | Scientific Article | Gold | Species, Gene/proteins, Cell type, Anatomy | link |
| CHEMDNER Patent | Patent | Gold | Chemicals | link |
| DECA | Scientific Article | Gold | Genes/proteins | link |
| Ex-PTM | Scientific Article | Gold | Genes/proteins (4698) | link |
| FSU-PRGE | Scientific Article | Gold | Genes/proteins | link |
| JNLPBA | Scientific Article | Gold | Genes/proteins (35,336), DNA (10,589), Cell type (8639), Cell line (4330), RNA (1069) | link |
| Linneaus | Scientific Article | Gold | Organisms (4263) | link |
| LocText | Scientific Article | Gold | Organisms, Genes/proteins | link |
| IEPA | Scientific Article | Gold | Genes/proteins | link |
| miRNA | Scientific Article | Gold | Disease, Organisms, Genes/proteins | link |
| NCBI disease | Scientific Article | Gold | Disease (6881) | link |
| S800 | Scientific Article | Gold | Organisms (3708) | link |
| Variome | Scientific Article | Gold | Disease, Organisms, Genes/proteins | link |
Note, some corpora included in this table are not included for download in this repository because they are not freely distributable.