hepcrawl icon indicating copy to clipboard operation
hepcrawl copied to clipboard

hepcrawl: add crawler for OSTI

Open tsgit opened this issue 5 years ago • 2 comments

* use API at OSTI to harvest records associated with SLAC

Signed-off-by: Thorsten Schwander [email protected]

Description

This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.

Related Issue

Motivation and Context

Checklist:

  • [x ] I have all the information that I need (if not, move to RFC and look for it).
  • [ ] I linked the related issue(s) in the corresponding commit logs.
  • [ x] I wrote good commit log messages.
  • [ ] My code follows the code style of this project.
  • [ ] I've added any new docs if API/utils methods were added.
  • [ ] I have updated the existing documentation accordingly.
  • [x ] I have added tests to cover my changes.
  • [ x] All new and existing tests passed.

tsgit avatar Sep 16 '19 03:09 tsgit

very good comments @michamos thanks

tsgit avatar Sep 16 '19 23:09 tsgit

right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.

on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !? if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations then the collaborations end up replicated

So I think LiteratureBuilder should ensure deduping of lists like collection and collaborations among others. That's beyond this PR, though.

I don't feel strongly about __method vs. _method, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.

tsgit avatar Sep 18 '19 04:09 tsgit