hepcrawl
hepcrawl copied to clipboard
hepcrawl: add crawler for OSTI
* use API at OSTI to harvest records associated with SLAC
Signed-off-by: Thorsten Schwander [email protected]
Description
This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.
Related Issue
Motivation and Context
Checklist:
- [x ] I have all the information that I need (if not, move to
RFC
and look for it). - [ ] I linked the related issue(s) in the corresponding commit logs.
- [ x] I wrote good commit log messages.
- [ ] My code follows the code style of this project.
- [ ] I've added any new docs if API/utils methods were added.
- [ ] I have updated the existing documentation accordingly.
- [x ] I have added tests to cover my changes.
- [ x] All new and existing tests passed.
very good comments @michamos thanks
right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.
on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !?
if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations
then the collaborations end up replicated
So I think LiteratureBuilder should ensure deduping of lists like collection
and collaborations
among others. That's beyond this PR, though.
I don't feel strongly about __method
vs. _method
, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.