article_dataset_builder
article_dataset_builder copied to clipboard
Issue with harvest_dois
I'm running into an issue where harvest_pmcids works but harvest_dois does not. For pmcids the PDFs are gathered, but for harvest_dois they are not.
I have run into this with arxiv dois, but then I tried with the dois in the test folder in this project.
The symptom is that harvester.diagnostic(full=True) shows "total invalid PDF: 7" when I run with the test DOIs.
Any chance that something is broken in the doi list approach, but not in the pmcids approach?
Hi @jameshowison !
The reason is that arXiv DOI are not CrossRef DOI, but DataCite DOI. This module only resolves CrossRef ones... So it results in 0 PDF found. This is the problem of the multiple new DOI providers, and the fact that preprint services now use these free DOIs.
I made something specific for arXiv https://github.com/kermitt2/arxiv_harvester for creating a full arXiv mirror, but not just for a few arXiv PDF.
Hmmm. Two things then,
- the DOI in https://github.com/kermitt2/article_dataset_builder/blob/master/test/dois.txt are also not working for me. Those aren't arxiv dois, are they?
- Where should the documentation show the issue with non-crossref dois? Maybe the method should be renamed
harvest_crossref_dois? Is there some way to detect DOIs that the module can't obtain?
Looks like the arxiv DOIs work using arxiv_base from the config.harvester file if strip off arvix. from the front of the DOIs. Eg.
doi:10.48550/arxiv.1808.06161
works to get direct PDF via
https://arxiv.org/pdf/1808.06161