biomedical_corpora
biomedical_corpora copied to clipboard
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper...
This table compiles the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). This has been published as part of the paper:
Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152
If you know of other relevant corpora, please submit a pull request and I'll happily approve it.
Corpus | Year | Format | Documents | Original Publication | Downloaded From | Other URLs |
---|---|---|---|---|---|---|
Ab3P (Abbreviation Plus P-Precision) | 2008 | BioC | 1250 PubMed Abstracts | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ | http://bioc.sourceforge.net/ | |
AIMed | 2005 | BioC | ~ 1000 MEDLINE abstracts (200 abstracts) | http://www.sciencedirect.com/science/article/pii/S0933365704001319 | http://corpora.informatik.hu-berlin.de/ | http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.3218&rep=rep1&type=pdf |
AnatEM (Anatomical entity mention recognition) | 2013 | CONLL, standoff | 1212 docs (500 docs from AnEM + 262 from MLEE + 450 others) | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ | http://nactem.ac.uk/anatomytagger/#AnatEM | |
AnEM | 2012 | BioC | 500 docs (PubMed and PMC); abstracts and full text drawn randomly | http://www.nactem.ac.uk/anatomy/docs/ohta2012opendomain.pdf | http://corpora.informatik.hu-berlin.de/ | |
AZDC (Arizona Disease Corpus) | 2009 | IeXML, .txt | 2856 PubMed abstracts (2775 sentences). Other source says 794 PubMed Abstracts | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ | http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/IeXML/goldcorpus/azdc-1.xml | http://diego.asu.edu/downloads/AZDC_6-26-2009.txt |
BEL (BioCreative V5 BEL Track) | 2016 | BioC | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ | https://wiki.openbel.org/display/BIOC/Datasets | ||
BioADI | 2009 | BioC | 1201 PubMed abstracts | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/ | http://bioc.sourceforge.net/ | |
BioCause | 2013 | standoff | 19 full-text documents | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-2 | http://www.nactem.ac.uk/biocause/download.php | |
BioCreative-PPI | XML | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html | ||||
BioGRID | 2017 | BioC | 120 full text articles | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225395/ | http://bioc.sourceforge.net/BioC-BioGRID.html | |
BioInfer | 2007 | BioC | 1100 sentences from biomedical literature | http://www.biomedcentral.com/1471-2105/8/50 | http://corpora.informatik.hu-berlin.de/ | http://mars.cs.utu.fi/BioInfer |
BioMedLat | 2016 | standoff | 643 BioASQ questions/factoids | https://www.semanticscholar.org/paper/BioMedLAT-Corpus-Annotation-of-the-Lexical-Answer-Neves-Kraus/b0f09f94015771c31bd2483efdd8f0f86996384e | https://github.com/mariananeves/BioMedLAT | |
BioText | 2004 | txt | 100 titles and 40 abstracts | http://biotext.berkeley.edu/papers/acl04-relations.pdf | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html | |
CDR (BioCreative V) | BioC | http://bioc.sourceforge.net/ | ||||
CellFinder 1.0 | 2012 | BioC | 10 full documents from PMC from (Loser et al. 2009) on "Human Embryonic Stem Cell Lines and Their Use in International Research" | http://www.nactem.ac.uk/biotxtm2012/presentations/Neves-pres.pdf | http://corpora.informatik.hu-berlin.de/ | http://cellfinder.de/about/annotation/ |
CG Cancer-Genetics (BioNLP-ST 2013) | 2013 | BioC, standoff | http://aclweb.org/anthology/W/W13/W13-2008.pdf | http://2013.bionlp-st.org/tasks/cancer-genetics | ||
CHEMDNER (BioCreative IV Track 2) | 2013 | BioC / standoff | http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf | http://www.biocreative.org/tasks/biocreative-iv/chemdner/ | ||
Chemical Patent Corpus | 2014 | standoff | 200 patents | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477 | http://biosemantics.org/index.php/resources/chemical-patent-corpus | |
CoMAGC | 2013 | XML | 821 sentences on prostate, breast and ovarian cancer | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323 | http://biopathway.org/CoMAGC/ | |
CRAFT | 2012 | 97 full OA biomedical articles | http://bionlp-corpora.sourceforge.net/CRAFT/ | |||
Craven (Wisconsin corpus) | 1999 | other | 1,529,731 sentences (automated) | https://www.biostat.wisc.edu/~craven/ie/ReadMe | https://www.biostat.wisc.edu/~craven/ie/ | |
CTD (BioCreative IV Track 3) | BioC | http://www.biocreative.org/tasks/biocreative-iv/track-3-CTD/ | ||||
DDICorpus | 2011 2013 | BioC | 792 texts from DrugBank and 233 Medline abstracts | https://www.ncbi.nlm.nih.gov/pubmed/23906817 | http://bioc.sourceforge.net/ http://corpora.informatik.hu-berlin.de/ | http://labda.inf.uc3m.es/ddicorpus |
DIP-PPI (Database of Interaction Proteins) | other | Only proteins from yeast. | https://www2.informatik.hu-berlin.de/~hakenber/corpora/ | |||
EBI:diseases | 2008 | other | 856 sentences from 624 abstracts | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S3-S3 | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html | ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases |
eFIP | 2012 2015 | xlsx | https://www.ncbi.nlm.nih.gov/pubmed/23221174 https://www.ncbi.nlm.nih.gov/pubmed/25833953 | http://research.bioinformatics.udel.edu/iprolink/corpora.php | ||
EMU (Extractor of Mutations) | 2011 | other | https://www.ncbi.nlm.nih.gov/pubmed/21138947 | http://bioinf.umbc.edu/EMU/ftp/ | ||
EU-ADR | 2012 | other | 300 PubMed abstracts (drug-disoder, drug-target, gene-disorder, SNP-disorder) | http://www.sciencedirect.com/science/article/pii/S1532046412000573 | http://biosemantics.org/index.php/resources/euadr-corpus | |
Exhaustive PTM (BioNLP 2011) | http://dl.acm.org/citation.cfm?id=2002902.2002920 | https://github.com/dterg/exhaustive-ptm | ||||
FlySlip | 2007 | CONLL | 82 abstracts, 5 full papers | https://www.ncbi.nlm.nih.gov/pubmed/17990496 | http://compbio.ucdenver.edu/ccp/corpora/obtaining.shtml | http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources |
FSU-PRGE | 2010 | leXML | 3236 MEDLINE abstracts (35,519 sentences) | http://aclweb.org/anthology/W/W10/W10-1838.pdf | http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html | |
GAD | 2015 | csv | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0472-9 | http://ibi.imim.es/research-lines/biomedical-text-mining/corpora/ | ||
GeneReg | 2010 | BioC | 314 Abstracts | http://www.lrec-conf.org/proceedings/lrec2010/pdf/407_Paper.pdf | http://corpora.informatik.hu-berlin.de/ | http://www.julielab.de/Resources/GeneReg.html |
GeneTag (BioCreative II Gene Mention) | 2005 | BioC | 20,000 sentences MEDLINE | https://www.ncbi.nlm.nih.gov/pubmed/15960837 | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://bioc.sourceforge.net/ | |
GENIA (BioNLP Shared Task 2009) | http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/detail.shtml#downloads | |||||
GENIA (BioNLP Shared Task 2011) | BioC, standoff | https://sites.google.com/site/bionlpst/home/epigenetics-and-post-translational-modifications http://2011.bionlp-st.org | http://corpora.informatik.hu-berlin.de/ | |||
GENIA (term annotation) | 2003 | BioC, XML | http://corpora.informatik.hu-berlin.de/ | http://www.nactem.ac.uk/aNT/genia.html | ||
GETM | 2010 | BioC, standoff | http://dl.acm.org/citation.cfm?id=1869970 | http://corpora.informatik.hu-berlin.de/ | http://getm-project.sourceforge.net/ | |
GREC (Gene Regulation Event Corpus) | 2009 | BioC, standoff, XML | 240 MEDLINE (167 on E.coli and 73 on Human) | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-349 | http://corpora.informatik.hu-berlin.de/ | http://www.nactem.ac.uk/GREC/ |
HIMERA | 2016 | standoff | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717 | http://www.nactem.ac.uk/himera/ | ||
HPRD50 (Human Protein Reference Database) | 2004 | BioC | 50 abstracts | https://www.ncbi.nlm.nih.gov/pubmed/14681466 | http://corpora.informatik.hu-berlin.de/ | http://www2.bio.ifi.lmu.de/publications/RelEx/ |
IDP4+ | 2007 | anndoc | 860 abstracts/full-texts | https://academic.oup.com/bioinformatics/article/33/12/1852/2991428 | https://www.tagtog.net/-corpora/IDP4+ | |
IEPA | 2002 | BioC | slightly over 300 MEDLINE abstracts | https://www.ncbi.nlm.nih.gov/pubmed/11928487 | http://corpora.informatik.hu-berlin.de/ | http://orbit.nlm.nih.gov/resource/iepa-corpus |
iHOP | 2004 | other | ~ 160 sentences | https://www.ncbi.nlm.nih.gov/pubmed/15226743 | http://www.ihop-net.org/UniPub/iHOP/info/gene_index/manual/1.html | |
iProLINK / RLIMS | 2004 | other, XML, BioC | https://www.ncbi.nlm.nih.gov/pubmed/15556482 | http://research.bioinformatics.udel.edu/iprolink/corpora.php | ||
iSimp | 2014 | BioC | 130 MEDLINE abstracts (1199 sentences) | https://www.ncbi.nlm.nih.gov/pubmed/24850848 | http://research.bioinformatics.udel.edu/isimp/corpus.html | |
Linnaeus | 2010 | standoff | http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-85 | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html | http://linnaeus.sourceforge.net/ | |
LLL (Learning Language in Logic) | 2005 | BioC | https://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf | http://corpora.informatik.hu-berlin.de/ | http://genome.jouy.inra.fr/texte/LLLchallenge/ | |
MEDSTRACT | BioC | 199 PubMed citations | https://www.ncbi.nlm.nih.gov/pubmed/11604766 | http://bioc.sourceforge.net/ | ||
MedTag | 2005 | other | https://www.researchgate.net/publication/234785358_MedTag_a_collection_of_biomedical_annotations | ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz https://sourceforge.net/projects/medtag/ | ||
Metabolite and Enzyme | 2011 | BioC, XML | 296 abstracts | http://link.springer.com/article/10.1007%2Fs11306-010-0251-6 | http://www.nactem.ac.uk/metabolite-corpus/ | http://argo.nactem.ac.uk/bioc/ |
miRTex | 2015 | BioC, standoff | 350 abstracts (200 development, 150 test) | http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004391 | http://research.bioinformatics.udel.edu/iprolink/corpora.php | |
MLEE | 2012 | CONLL, standoff | 262 PubMed abstracts on molecular mechanisms of cancer (specifically relating to angiogenesis) | https://academic.oup.com/bioinformatics/article/28/18/i575/249872/Event-extraction-across-multiple-levels-of | http://nactem.ac.uk/MLEE/ | |
mTOR pathway event corpus (BioNLP 2011) | 2011 | standoff | http://dl.acm.org/citation.cfm?id=2002919 | https://github.com/dterg/mtor-pathway/tree/master/original-data | ||
MutationFinder | 2007 | other | 305 abstract (development data set), 508 abstract test set | https://www.ncbi.nlm.nih.gov/pubmed/17495998 | http://mutationfinder.sourceforge.net/ | https://github.com/rockt/SETH |
Nagel | XML, standoff | http://sourceforge.net/projects/bionlp-corpora/files/ProteinResidue/ | ||||
NCBI Disease | 2012 | other | 6881 sentences in 793 PubMed abstracts | https://www.ncbi.nlm.nih.gov/pubmed/24393765 | http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html | |
OMM (Open Mutation Miner) | 2012 | other | 40 full texts | https://www.ncbi.nlm.nih.gov/pubmed/22759648 | http://www.semanticsoftware.info/open-mutation-miner | |
OSIRIS | 2008 | BioC, XML, standoff | 105 articles | https://www.ncbi.nlm.nih.gov/pubmed/18251998 | http://corpora.informatik.hu-berlin.de/ | https://sites.google.com/site/laurafurlongweb/databases-and-tools/corpora |
PC (Pathway Curation) (BioNLP-ST 2013) | 2013 | BioC | http://argo.nactem.ac.uk/bioc/ | http://2013.bionlp-st.org/tasks/pathway-curation | ||
PennBioIE-oncology | 2004 | leXML | 1414 PubMed abstracts on cancer | http://www.aclweb.org/anthology/W04-3111 | http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html | |
pGenN (Plant-GN) | 2015 | BioC | 104 MEDLINE abstracts | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135305 | http://research.bioinformatics.udel.edu/iprolink/corpora.php | |
PICAD | 2011 | XML | 1037 sentences from PubMed | http://dl.acm.org/citation.cfm?doid=2147805.2147853 | http://ani.stat.fsu.edu/~jinfeng/resources/PICAD.txt | http://corpora.informatik.hu-berlin.de/ |
PolySearch (includes v1. and v2.) | other | https://www.ncbi.nlm.nih.gov/pubmed/25925572 | http://polysearch.cs.ualberta.ca/downloads | |||
ProteinResidue | other | http://bionlp-corpora.sourceforge.net/ | ||||
SCAI_Klinger | 2008 | CONLL | https://academic.oup.com/bioinformatics/article/24/13/i268/235854/Detection-of-IUPAC-and-IUPAC-like-chemical-names | https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html | ||
SCAI_Kolarik | 2008 | CONLL | http://www.lrec-conf.org/proceedings/lrec2008/workshops/W4_Proceedings.pdf#page=55 | https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html | ||
SETH | 2016 | standoff | 630 publications from The American Journal of Human Genetics and Human Mutation | https://www.ncbi.nlm.nih.gov/pubmed/?term=27256315 | https://github.com/rockt/SETH/tree/master/resources/SETH-corpus | |
SH (Schwartz and Hearst) | 2003 | BioC | 1000 PubMed Abstracts | https://www.ncbi.nlm.nih.gov/pubmed/12603049 | http://bioc.sourceforge.net/ | |
SNPCorpus | 2011 | BioC | 296 MEDLINE abstracts | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194196/ | http://corpora.informatik.hu-berlin.de/ | http://www.scai.fraunhofer.de/snp-normalization-corpus.html |
Species | 2013 | standoff | 800 PubMed abstracts | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390 | http://species.jensenlab.org/ | http://species.jensenlab.org/ |
T4SS (Type 4 Secretion System) | 2011 | CONLL | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 | http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 | ||
T4SS Event Extraction (BioNLP 2010) | 2010 | other | http://dl.acm.org/citation.cfm?id=1869961.1869980 | https://github.com/dterg/t4ss-event | ||
tmVar | 2013 | BioC | 500 PubMed abstracts | https://www.ncbi.nlm.nih.gov/pubmed/23564842 | https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#tmVar | https://github.com/rockt/SETH |
VariomeCorpus (hvp) | 2013 | BioC | https://www.ncbi.nlm.nih.gov/pubmed/23584833 | http://corpora.informatik.hu-berlin.de/ | http://www.opennicta.com/home/health/variome | |
Yapex | 2002 | other | 99 training, 101 test MEDLINE abstracts | https://www.ncbi.nlm.nih.gov/pubmed/12460631 | http://www.rostlab.org/~nlprot/yapex.txt | https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |