biomedical_corpora icon indicating copy to clipboard operation
biomedical_corpora copied to clipboard

Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper...

This table compiles the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). This has been published as part of the paper:

Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152

If you know of other relevant corpora, please submit a pull request and I'll happily approve it.

Corpus Year Format Documents Original Publication Downloaded From Other URLs
Ab3P (Abbreviation Plus P-Precision) 2008 BioC 1250 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ http://bioc.sourceforge.net/
AIMed 2005 BioC ~ 1000 MEDLINE abstracts (200 abstracts) http://www.sciencedirect.com/science/article/pii/S0933365704001319 http://corpora.informatik.hu-berlin.de/ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.3218&rep=rep1&type=pdf
AnatEM (Anatomical entity mention recognition) 2013 CONLL, standoff 1212 docs (500 docs from AnEM + 262 from MLEE + 450 others) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM
AnEM 2012 BioC 500 docs (PubMed and PMC); abstracts and full text drawn randomly http://www.nactem.ac.uk/anatomy/docs/ohta2012opendomain.pdf http://corpora.informatik.hu-berlin.de/
AZDC (Arizona Disease Corpus) 2009 IeXML, .txt 2856 PubMed abstracts (2775 sentences). Other source says 794 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/IeXML/goldcorpus/azdc-1.xml http://diego.asu.edu/downloads/AZDC_6-26-2009.txt
BEL (BioCreative V5 BEL Track) 2016 BioC https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ https://wiki.openbel.org/display/BIOC/Datasets
BioADI 2009 BioC 1201 PubMed abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/ http://bioc.sourceforge.net/
BioCause 2013 standoff 19 full-text documents http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-2 http://www.nactem.ac.uk/biocause/download.php
BioCreative-PPI XML https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html
BioGRID 2017 BioC 120 full text articles https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225395/ http://bioc.sourceforge.net/BioC-BioGRID.html
BioInfer 2007 BioC 1100 sentences from biomedical literature http://www.biomedcentral.com/1471-2105/8/50 http://corpora.informatik.hu-berlin.de/ http://mars.cs.utu.fi/BioInfer
BioMedLat 2016 standoff 643 BioASQ questions/factoids https://www.semanticscholar.org/paper/BioMedLAT-Corpus-Annotation-of-the-Lexical-Answer-Neves-Kraus/b0f09f94015771c31bd2483efdd8f0f86996384e https://github.com/mariananeves/BioMedLAT
BioText 2004 txt 100 titles and 40 abstracts http://biotext.berkeley.edu/papers/acl04-relations.pdf https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html
CDR (BioCreative V) BioC http://bioc.sourceforge.net/
CellFinder 1.0 2012 BioC 10 full documents from PMC from (Loser et al. 2009) on "Human Embryonic Stem Cell Lines and Their Use in International Research" http://www.nactem.ac.uk/biotxtm2012/presentations/Neves-pres.pdf http://corpora.informatik.hu-berlin.de/ http://cellfinder.de/about/annotation/
CG Cancer-Genetics (BioNLP-ST 2013) 2013 BioC, standoff http://aclweb.org/anthology/W/W13/W13-2008.pdf http://2013.bionlp-st.org/tasks/cancer-genetics
CHEMDNER (BioCreative IV Track 2) 2013 BioC / standoff http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf http://www.biocreative.org/tasks/biocreative-iv/chemdner/
Chemical Patent Corpus 2014 standoff 200 patents http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477 http://biosemantics.org/index.php/resources/chemical-patent-corpus
CoMAGC 2013 XML 821 sentences on prostate, breast and ovarian cancer http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323 http://biopathway.org/CoMAGC/
CRAFT 2012 97 full OA biomedical articles http://bionlp-corpora.sourceforge.net/CRAFT/
Craven (Wisconsin corpus) 1999 other 1,529,731 sentences (automated) https://www.biostat.wisc.edu/~craven/ie/ReadMe https://www.biostat.wisc.edu/~craven/ie/
CTD (BioCreative IV Track 3) BioC http://www.biocreative.org/tasks/biocreative-iv/track-3-CTD/
DDICorpus 2011 2013 BioC 792 texts from DrugBank and 233 Medline abstracts https://www.ncbi.nlm.nih.gov/pubmed/23906817 http://bioc.sourceforge.net/ http://corpora.informatik.hu-berlin.de/ http://labda.inf.uc3m.es/ddicorpus
DIP-PPI (Database of Interaction Proteins) other Only proteins from yeast. https://www2.informatik.hu-berlin.de/~hakenber/corpora/
EBI:diseases 2008 other 856 sentences from 624 abstracts http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S3-S3 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases
eFIP 2012 2015 xlsx https://www.ncbi.nlm.nih.gov/pubmed/23221174 https://www.ncbi.nlm.nih.gov/pubmed/25833953 http://research.bioinformatics.udel.edu/iprolink/corpora.php
EMU (Extractor of Mutations) 2011 other https://www.ncbi.nlm.nih.gov/pubmed/21138947 http://bioinf.umbc.edu/EMU/ftp/
EU-ADR 2012 other 300 PubMed abstracts (drug-disoder, drug-target, gene-disorder, SNP-disorder) http://www.sciencedirect.com/science/article/pii/S1532046412000573 http://biosemantics.org/index.php/resources/euadr-corpus
Exhaustive PTM (BioNLP 2011) http://dl.acm.org/citation.cfm?id=2002902.2002920 https://github.com/dterg/exhaustive-ptm
FlySlip 2007 CONLL 82 abstracts, 5 full papers https://www.ncbi.nlm.nih.gov/pubmed/17990496 http://compbio.ucdenver.edu/ccp/corpora/obtaining.shtml http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources
FSU-PRGE 2010 leXML 3236 MEDLINE abstracts (35,519 sentences) http://aclweb.org/anthology/W/W10/W10-1838.pdf http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html
GAD 2015 csv http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0472-9 http://ibi.imim.es/research-lines/biomedical-text-mining/corpora/
GeneReg 2010 BioC 314 Abstracts http://www.lrec-conf.org/proceedings/lrec2010/pdf/407_Paper.pdf http://corpora.informatik.hu-berlin.de/ http://www.julielab.de/Resources/GeneReg.html
GeneTag (BioCreative II Gene Mention) 2005 BioC 20,000 sentences MEDLINE https://www.ncbi.nlm.nih.gov/pubmed/15960837 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://bioc.sourceforge.net/
GENIA (BioNLP Shared Task 2009) http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/detail.shtml#downloads
GENIA (BioNLP Shared Task 2011) BioC, standoff https://sites.google.com/site/bionlpst/home/epigenetics-and-post-translational-modifications http://2011.bionlp-st.org http://corpora.informatik.hu-berlin.de/
GENIA (term annotation) 2003 BioC, XML http://corpora.informatik.hu-berlin.de/ http://www.nactem.ac.uk/aNT/genia.html
GETM 2010 BioC, standoff http://dl.acm.org/citation.cfm?id=1869970 http://corpora.informatik.hu-berlin.de/ http://getm-project.sourceforge.net/
GREC (Gene Regulation Event Corpus) 2009 BioC, standoff, XML 240 MEDLINE (167 on E.coli and 73 on Human) http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-349 http://corpora.informatik.hu-berlin.de/ http://www.nactem.ac.uk/GREC/
HIMERA 2016 standoff http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717 http://www.nactem.ac.uk/himera/
HPRD50 (Human Protein Reference Database) 2004 BioC 50 abstracts https://www.ncbi.nlm.nih.gov/pubmed/14681466 http://corpora.informatik.hu-berlin.de/ http://www2.bio.ifi.lmu.de/publications/RelEx/
IDP4+ 2007 anndoc 860 abstracts/full-texts https://academic.oup.com/bioinformatics/article/33/12/1852/2991428 https://www.tagtog.net/-corpora/IDP4+
IEPA 2002 BioC slightly over 300 MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pubmed/11928487 http://corpora.informatik.hu-berlin.de/ http://orbit.nlm.nih.gov/resource/iepa-corpus
iHOP 2004 other ~ 160 sentences https://www.ncbi.nlm.nih.gov/pubmed/15226743 http://www.ihop-net.org/UniPub/iHOP/info/gene_index/manual/1.html
iProLINK / RLIMS 2004 other, XML, BioC https://www.ncbi.nlm.nih.gov/pubmed/15556482 http://research.bioinformatics.udel.edu/iprolink/corpora.php
iSimp 2014 BioC 130 MEDLINE abstracts (1199 sentences) https://www.ncbi.nlm.nih.gov/pubmed/24850848 http://research.bioinformatics.udel.edu/isimp/corpus.html
Linnaeus 2010 standoff http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-85 https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://linnaeus.sourceforge.net/
LLL (Learning Language in Logic) 2005 BioC https://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf http://corpora.informatik.hu-berlin.de/ http://genome.jouy.inra.fr/texte/LLLchallenge/
MEDSTRACT BioC 199 PubMed citations https://www.ncbi.nlm.nih.gov/pubmed/11604766 http://bioc.sourceforge.net/
MedTag 2005 other https://www.researchgate.net/publication/234785358_MedTag_a_collection_of_biomedical_annotations ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz https://sourceforge.net/projects/medtag/
Metabolite and Enzyme 2011 BioC, XML 296 abstracts http://link.springer.com/article/10.1007%2Fs11306-010-0251-6 http://www.nactem.ac.uk/metabolite-corpus/ http://argo.nactem.ac.uk/bioc/
miRTex 2015 BioC, standoff 350 abstracts (200 development, 150 test) http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004391 http://research.bioinformatics.udel.edu/iprolink/corpora.php
MLEE 2012 CONLL, standoff 262 PubMed abstracts on molecular mechanisms of cancer (specifically relating to angiogenesis) https://academic.oup.com/bioinformatics/article/28/18/i575/249872/Event-extraction-across-multiple-levels-of http://nactem.ac.uk/MLEE/
mTOR pathway event corpus (BioNLP 2011) 2011 standoff http://dl.acm.org/citation.cfm?id=2002919 https://github.com/dterg/mtor-pathway/tree/master/original-data
MutationFinder 2007 other 305 abstract (development data set), 508 abstract test set https://www.ncbi.nlm.nih.gov/pubmed/17495998 http://mutationfinder.sourceforge.net/ https://github.com/rockt/SETH
Nagel XML, standoff http://sourceforge.net/projects/bionlp-corpora/files/ProteinResidue/
NCBI Disease 2012 other 6881 sentences in 793 PubMed abstracts https://www.ncbi.nlm.nih.gov/pubmed/24393765 http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html
OMM (Open Mutation Miner) 2012 other 40 full texts https://www.ncbi.nlm.nih.gov/pubmed/22759648 http://www.semanticsoftware.info/open-mutation-miner
OSIRIS 2008 BioC, XML, standoff 105 articles https://www.ncbi.nlm.nih.gov/pubmed/18251998 http://corpora.informatik.hu-berlin.de/ https://sites.google.com/site/laurafurlongweb/databases-and-tools/corpora
PC (Pathway Curation) (BioNLP-ST 2013) 2013 BioC http://argo.nactem.ac.uk/bioc/ http://2013.bionlp-st.org/tasks/pathway-curation
PennBioIE-oncology 2004 leXML 1414 PubMed abstracts on cancer http://www.aclweb.org/anthology/W04-3111 http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html
pGenN (Plant-GN) 2015 BioC 104 MEDLINE abstracts http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135305 http://research.bioinformatics.udel.edu/iprolink/corpora.php
PICAD 2011 XML 1037 sentences from PubMed http://dl.acm.org/citation.cfm?doid=2147805.2147853 http://ani.stat.fsu.edu/~jinfeng/resources/PICAD.txt http://corpora.informatik.hu-berlin.de/
PolySearch (includes v1. and v2.) other https://www.ncbi.nlm.nih.gov/pubmed/25925572 http://polysearch.cs.ualberta.ca/downloads
ProteinResidue other http://bionlp-corpora.sourceforge.net/
SCAI_Klinger 2008 CONLL https://academic.oup.com/bioinformatics/article/24/13/i268/235854/Detection-of-IUPAC-and-IUPAC-like-chemical-names https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html
SCAI_Kolarik 2008 CONLL http://www.lrec-conf.org/proceedings/lrec2008/workshops/W4_Proceedings.pdf#page=55 https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html
SETH 2016 standoff 630 publications from The American Journal of Human Genetics and Human Mutation https://www.ncbi.nlm.nih.gov/pubmed/?term=27256315 https://github.com/rockt/SETH/tree/master/resources/SETH-corpus
SH (Schwartz and Hearst) 2003 BioC 1000 PubMed Abstracts https://www.ncbi.nlm.nih.gov/pubmed/12603049 http://bioc.sourceforge.net/
SNPCorpus 2011 BioC 296 MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194196/ http://corpora.informatik.hu-berlin.de/ http://www.scai.fraunhofer.de/snp-normalization-corpus.html
Species 2013 standoff 800 PubMed abstracts http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390 http://species.jensenlab.org/ http://species.jensenlab.org/
T4SS (Type 4 Secretion System) 2011 CONLL http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780
T4SS Event Extraction (BioNLP 2010) 2010 other http://dl.acm.org/citation.cfm?id=1869961.1869980 https://github.com/dterg/t4ss-event
tmVar 2013 BioC 500 PubMed abstracts https://www.ncbi.nlm.nih.gov/pubmed/23564842 https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#tmVar https://github.com/rockt/SETH
VariomeCorpus (hvp) 2013 BioC https://www.ncbi.nlm.nih.gov/pubmed/23584833 http://corpora.informatik.hu-berlin.de/ http://www.opennicta.com/home/health/variome
Yapex 2002 other 99 training, 101 test MEDLINE abstracts https://www.ncbi.nlm.nih.gov/pubmed/12460631 http://www.rostlab.org/~nlprot/yapex.txt https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html