inspire-next
inspire-next copied to clipboard
Port existing journal workflows to Holding Pen
The existing journal workflows need to be ported to use the new hepcrawl
service based on scrapy
. Scheduled and one-shot harvests can be made by triggering harvests via appropriate Celery tasks which integrates with the scrapyd
service. The results are then pushed back to INSPIRE when crawling is completed (or in error state) and appropriate HEP ingestion workflows are launched (WIP in #730).
- [x] Unify harvesting workflow into one "article" workflow for all sources by modifying
arxiv_harvest_math
to be generic (unified tohep_ingestion
). - [x] Add support for easy source filtering in new interface for HEP ingestion workflows
- [x] Add pipeline to INSPIRE rabbitmq server in
hepcrawl
- [x] Implement initial spiders in
hepcrawl
: - [x] World Scientific
- [x] arXiv
- [x] Elsevier
- [x] APS
- [x] PoS
- [ ] Implement remaining spiders in
hepcrawl
- [x] IOP
- [ ] Springer (under development by @fschwenn)
- [x] Hindawi
- [ ] EDPSciences ++
Edit: Updated June 2016
@jalavik Can you provide a status update of this task?
Stupid question: APS is ticked - does it mean it is considered 'completed'? I somehow can not see how it handles references?
@fschwenn You are right. Not all of them are 100% completed yet, but at least it exists and produces some metadata and has an integration with the source, which is why I ticked it off. Many of the crawlers probably need more love to be 100% good. We should add missing features as issues on the hepcrawl github issues.
In regards to references, I see now that their new JSON api does not seem to include them http://harvest.aps.org/docs/harvest-api#general so we need to request and parse the xml format additionally for this info:
curl -H 'Accept: text/xml' http://harvest.aps.org/v2/journals/articles/10.1103/PhysRevSTAB.4.072801
@inspirehep/inspire-dir Maybe asking to publishers to provide references through their API could be a topic for the next AAHEP?
@bittirousku @david-caro @fschwenn do you know what is the status of Springer and EDPSciences ++? Are there open PRs that can be linked to this issue?
Florian is working on the Springer Crawler. We get references in the feeds. It's mainly IOP that doesn't provide references.
We have inspirehep/hepcrawl#43 for the EDP one, I'm not sure about the springer one.