inspire-next icon indicating copy to clipboard operation
inspire-next copied to clipboard

Port existing journal workflows to Holding Pen

Open jalavik opened this issue 9 years ago • 7 comments

The existing journal workflows need to be ported to use the new hepcrawl service based on scrapy. Scheduled and one-shot harvests can be made by triggering harvests via appropriate Celery tasks which integrates with the scrapyd service. The results are then pushed back to INSPIRE when crawling is completed (or in error state) and appropriate HEP ingestion workflows are launched (WIP in #730).

  • [x] Unify harvesting workflow into one "article" workflow for all sources by modifying arxiv_harvest_math to be generic (unified to hep_ingestion).
  • [x] Add support for easy source filtering in new interface for HEP ingestion workflows
  • [x] Add pipeline to INSPIRE rabbitmq server in hepcrawl
  • [x] Implement initial spiders in hepcrawl:
  • [x] World Scientific
  • [x] arXiv
  • [x] Elsevier
  • [x] APS
  • [x] PoS
  • [ ] Implement remaining spiders in hepcrawl
  • [x] IOP
  • [ ] Springer (under development by @fschwenn)
  • [x] Hindawi
  • [ ] EDPSciences ++

Edit: Updated June 2016

jalavik avatar Jul 14 '15 10:07 jalavik

@jalavik Can you provide a status update of this task?

kaplun avatar Feb 25 '16 13:02 kaplun

Stupid question: APS is ticked - does it mean it is considered 'completed'? I somehow can not see how it handles references?

fschwenn avatar Jun 21 '16 13:06 fschwenn

@fschwenn You are right. Not all of them are 100% completed yet, but at least it exists and produces some metadata and has an integration with the source, which is why I ticked it off. Many of the crawlers probably need more love to be 100% good. We should add missing features as issues on the hepcrawl github issues.

In regards to references, I see now that their new JSON api does not seem to include them http://harvest.aps.org/docs/harvest-api#general so we need to request and parse the xml format additionally for this info:

curl  -H 'Accept: text/xml' http://harvest.aps.org/v2/journals/articles/10.1103/PhysRevSTAB.4.072801

jalavik avatar Jun 21 '16 13:06 jalavik

@inspirehep/inspire-dir Maybe asking to publishers to provide references through their API could be a topic for the next AAHEP?

kaplun avatar Aug 10 '16 11:08 kaplun

@bittirousku @david-caro @fschwenn do you know what is the status of Springer and EDPSciences ++? Are there open PRs that can be linked to this issue?

kaplun avatar Aug 10 '16 11:08 kaplun

Florian is working on the Springer Crawler. We get references in the feeds. It's mainly IOP that doesn't provide references.

ksachs avatar Aug 10 '16 12:08 ksachs

We have inspirehep/hepcrawl#43 for the EDP one, I'm not sure about the springer one.

david-caro avatar Aug 10 '16 12:08 david-caro