hepcrawl
hepcrawl copied to clipboard
DESY FTP
During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.
I'd propose that the FTP is divided into one directory per feed.
@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?
Sorry - misunderstanding.
For Elsevier, World Scientific, APS, PoS the publisher data are currently harvested at CERN. I don't know whether on legacy or labs. After the conversion CERN deposits INSPIRE-xml on the DESY FTP server and sends an email to [email protected]. We need the DESY FTP server only as long as we do the matching/selection/merging via the DESY workflow.
Springer serves their data on their FTP server (ftp.springer-dds.com), no need to copy it to DESY when the harvesting will be done at CERN.
PTEP and Acta Physica Polonica B send emails with attachments. Is there a possibility at CERN to feed email attachments to a HEPcrawl spider?
Other emails are only alerts to trigger a web-crawl program. Again it would be nice if an email could trigger a HEPcrawl spider. For now we just process these journals at DESY. We don't have HEPcrawl spiders for those anyhow.
I think the easiest thing would be that you indeed store those attachement into a share space such as the mentioned DESY FTP server.
For the triggers... Mmh... So, hepcrawl has indeed an interface to trigger a crawl, @david-caro might provide more information about it. Basically you could then send an HTTP POST request to hepcrawl to trigger the harvesting of the corresponding journal.
http://pythonhosted.org/hepcrawl/operations.html#schedule-crawls
Last week we agreed to create a simple interface to allow hepcrawl to harvest marcxml records from DESY, that way we are not hurried by the legacy shutdown to implement any DESY side flows, and that can be done calmly and bit by bit.
So in order to bootstrap that conversation, I propose to add a folder in DESY FTP with the records to harvest, and heprcawl will pick them up periodically.
The records should be separated in subfolders by source, so hepcrawl knows where they originally come from (springer, elsevier...).
What do you think?
Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.
But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.
It needs the source of the crawl for various reasons:
- In order to display it properly in the holding pen, sort/search/facets/...
- So we can properly match it with the last update from that source (not yet there, but will be needed).
- Tracking purposes, as it's not the same coming from a spider that crawls directly publisher A, than coming from desy, even if both originally came from the same publisher A.
But yes, having it in the metadata somehow might be enough, just proposed the directory structure for easy organization and implementation (50 dirs is not that many, and easily allows seeing if any provider source is empty or not being crawled properly, adding it to the metadata only means having to check the contents of the files every time you want to know something similar).
The key point being, we need a stable and reliable way of knowing the origin of the record.
-- David Caro [email protected]
CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net On May 22, 2017 12:48 PM, ksachs [email protected] wrote:
Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.
But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/hepcrawl/issues/73#issuecomment-303065929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAj1jPVAOtZIbBdbFQdoxIqQiBgsjEtrks5r8Wf7gaJpZM4KspYa.
The origin of the record is 'DESY'
- for display the journal might be more useful, fall-back 'DESY' or the publisher if it is in the metadata
- matching: only relevant when the data are coming directly from the publisher, e.g. spinger crawler
- for tracking purposes the source is DESY, the rest is our (=DESY local) problem including the question whether a publisher got 'stuck'.
This workflow via DESY can be a short term solution for the bigger publishers. Only for the small and infrequent publishers we will need it for a longer period. There it doesn't help to know the folder is still empty, this might be correct. Florian and I would suggest to leave the responsibility whether the harverst/conversion went fine with DESY and just process what is in the metadata.
Ideally would be great to have the real source (i.e. the name of the publisher) so that later, when a crawler is ported from DESY to INSPIRE it is possible to compare apples with apples. As you might remember, in order to implement the automatic merging of a record update we need to fetch the last version for the corresponding source of the record that is being manipulated. If all the sources read DESY, then we you need to guarantee that you won't ever have the same publication coming through 2 separate sources that are then masked as DESY when they arrive to INSPIRE.
But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.
@david-caro I think this should be good enough also for hepcrawl indeed to guess the source. After all the source doesn't need to be associated with one and only one hepcrawl-spider.
Then how do we differentiate desy ones from non-desy ones?
don't mix source (way to harvest) and publisher (metadata)
@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE. For big publishers the DESY-spider workaround is a short(!!!)-term temporary solution. Don't make it perfect. For small publishers - that's peanuts. We don't need to compare to previous version. In any case: it's DESY spider + DOI you can compare to.
@david-caro desy-spider -> source=DESY, publisher = whatever is in the metadata other spider -> non-desy
@ksachs in inspire-schema we call source the origin of truth. I.e. the publisher. How things reach us has a sort of a lesser importance and it goes into acquisition_source
.
@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.
Sure but anyway we should start from somewhere, and updates from publishers will be most often about papers that reached us within the last year as preprint. So if we start to have clear data from now onwards, we are going to in regime in one year (i.e. much less pain for cataloger due to unresolved conflicts due to missing/untraceable history).
maybe we are not talking about the same thing. A video meeting might be helpful. For arXiv: do you want to compare to another arXiv version or the update that comes from the publisher? For most preprints we don't get the publisher info from arXiv. If we do it can be publisher or journal.
Is there a show-stopper if you just convert the marc to json as for existing INSPIRE records + acquisition_source = DESY?
Ok, so in the end, the acquisition_source
for records that are harvested by the desy spider will be:
"acquisition_source": {
"method": "hepcrawl",
"source": "desy"
}
And the data of the record will be exactly whatever is passed from desy (the output of dojson on the xml).
Anyone disagrees?
And, the topic of the issue, the ftp will just be a folder with individual xml files, one per record. That will be removed upon ingestion (I recommend moving to a temporary dir that gets cleaned up periodically, though that should probably done on the server side if you want it, just in case we want to rerun anything).
I am not sure one XML file per record is the easiest on DESY side. What about the possibility of grouping multiple records in on MARCXML file? (normally multiple MARCXML records are grouped into a <collection> ... </collection>
Right, it would be easier if we could pass on collections of records in a file.
Hmm, then in order to parse them we would have to iterate for each record on every file... That might be messy on scrapy side.
-- David Caro [email protected]
CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net
On Jul 3, 2017 14:12, Florian Schwennsen [email protected] wrote:
Right, it would be easier if we could pass on collections of records in a file.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/inspirehep/hepcrawl/issues/73#issuecomment-312629322, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAj1jAbigPHToW2jRGeDYbr38QxUJUSbks5sKNpqgaJpZM4KspYa.
If needed, we can split the xml also on DESY side - no problem.
No need, we can do on our side :), thanks!
Another question, the macxml files you provide will have files attached to them right? If so, what paths will they have? (so we can download them) @ksachs @fschwenn ^
The publishers where we get fulltexts will run via HEPCrawl. For all these smaller publishers for which we need the DESYmarcxmlSpider the only fulltexts are OA for which the xml would contain a weblink.
There will be an overlapping time where some big publishers will still run on desy (springer for example), so we should support those too right?