Headstart icon indicating copy to clipboard operation
Headstart copied to clipboard

pubmed dumps

Open sckott opened this issue 6 years ago • 1 comments

Due to various issues with using the NCBI Entrez API (e.g, #257 ) we could explore using pubmed dumps instead of calling their API.

  • the ftp dump page https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ - NOPE, see below

The above link I think is just the OA subset, not sure how many papers that is exactly. I don't think that's what we'd need, but maybe it is.

I guess the draw back of this approach is that we'd need to build our own search on top of the data. Could index it all with solr/elasticsearch, then call that internal API?

sckott avatar Nov 12 '19 18:11 sckott

  • the metadata we want is at https://www.nlm.nih.gov/databases/download/pubmed_medline.html
  • the xml can be slow to parse, bryan suggested some tricks (in python), see DM's
  • maybe metapub, metapub.pubmedarticle.PubMedArticle a MedLine XML parser (python)

sckott avatar Nov 12 '19 20:11 sckott