Headstart
Headstart copied to clipboard
pubmed dumps
Due to various issues with using the NCBI Entrez API (e.g, #257 ) we could explore using pubmed dumps instead of calling their API.
- the ftp dump page https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ - NOPE, see below
The above link I think is just the OA subset, not sure how many papers that is exactly. I don't think that's what we'd need, but maybe it is.
I guess the draw back of this approach is that we'd need to build our own search on top of the data. Could index it all with solr/elasticsearch, then call that internal API?
- the metadata we want is at https://www.nlm.nih.gov/databases/download/pubmed_medline.html
- the xml can be slow to parse, bryan suggested some tricks (in python), see DM's
- maybe metapub,
metapub.pubmedarticle.PubMedArticlea MedLine XML parser (python)